Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 74

GlobusWorld 2020 Keynote

1

Share

Download to read offline

This presentation was given at the GlobusWorld 2020 Virtual Conference, by Ian Foster, Rachana Ananthakrishnan, and Vas Vasiliadis from the University of Chicago.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

GlobusWorld 2020 Keynote

  1. 1. #globusworld @globus
  2. 2. New to Globus? Join us for Globus 101++ tomorrow @ 11am–2:30pm CDT globusworld.org/program
  3. 3. GlobusWorld 2020 Sponsors
  4. 4. A Decade of Enabling Science Ian Foster Rachana Ananthakrishnan Vas Vasiliadis April 29, 2020
  5. 5. Our Mission Increase the efficiency and effectiveness of researchers engaged in data-driven science and scholarship through sustainable software
  6. 6. 7
  7. 7. Our mission today
  8. 8. CanonicalizationChemical Databases Compute Features Fingerprinting ML based filtering Similarity Search Generate Images DNN filtering Computing Resources cureFFI GDB MOSES ZINC15 and more… Supporting the nCoV collaboration 2019-ncovgroup.github.io
  9. 9. First release: 21 sources, 3.9B molecules, 80 TB computed features 2019-ncovgroup.github.io
  10. 10. ENAMINE REAL 1.2 billion molecules which comply with “rule of 5“ and Veber criteria: MW≤500, SlogP≤5, HBA≤10, HBD≤5, rotatable bonds≤10, TPSA≤140. 21 sources, 3.9B molecules, 80 TB computed features 2019-ncovgroup.github.io
  11. 11. 12 xxxxxxxxxxxx
  12. 12. 13 xxxxxxxxxxxx Katrin Heitmann
  13. 13. 14 Google Cloud upload 5GB Google Cloud download 5GB
  14. 14. Charting future missions
  15. 15. Globus Labs Mission To make research data research data are reliably, rapidly, and securely accessible, discoverable, and usable By.. Developing an automated and scalable platform for reproducible research that can exploit heterogenous resources that span the computing continuum ƒuncX Model registry Flows Cost map Write programs Function fabric Data/Trust fabric Automate DLHub Globus SCRIMP Metadata Extraction Xtract
  16. 16. Portable code Any access Any computer Python Docker, Shifter, Singularity Clusters, clouds, HPC, accelerators Cloud API, cluster or HPC scheduler funcX distributed function as a service ƒuncX Model registry Flows Cost map Write programs Function fabric Data/Trust fabric Automate DLHub Globus SCRIMP Metadata Extraction Xtract
  17. 17. funcX: Transform clouds, clusters, and supercomputers into high-performance function serving systems 18 EP(x) EP(x) EP(x) EP(x) funcX Simply deploy funcX endpoint to transform a computer into a function serving system repo2dockerRegister EP(x) f(x) g(x) h(x) k(x) f(x) g(x) EP(x) h(x) k(x) f(x), … + depend -encies
  18. 18. 19 EP(x) EP(x) EP(x) EP(x) f(x) g(x) h(x) k(x) repo2dockerRegister f(x) g(x) h(x) k(x) Registration f(x), g(x), … + dependencies EP(x) registry Execution f(x), … [1,2,3 … n] Simply deploy funcX endpoint to transform a computer into a function serving system repo2dockerRegister EP(x) f(x) g(x) h(x) k(x) f(x) g(x) EP(x) h(x) k(x) f(x), … + depend -encies funcX: Transform clouds, clusters, and supercomputers into high-performance function serving systems
  19. 19. Parsl: parallel programming in Python arxiv.org/pdf/1905.02158 parsl-project.org ƒuncX Model registry Flows Cost map Write programs Function fabric Data/Trust fabric Automate DLHub Globus SCRIMP Metadata Extraction Xtract
  20. 20. Cost-aware computing with heterogeneous platforms Incremental construction of a personalized cost map • Build black-box performance models from observed execution times for different codes on different platforms • Transfer learning across codes, problem sizes, and hardware platforms • Experiment design to choose experiments that maximize reduction in uncertainty • Evolve models over time as codes and platforms change • Use models for instance selection and scheduling ƒuncX Model registry Flows Cost map Write programs Function fabric Data/Trust fabric Automate DLHub Globus SCRIMP Metadata Extraction Xtract
  21. 21. 22 Virtual CPUs RAM(GB) Example: A cost map for bioinformatics applications on different AWS instance types IndexBam performs better on compute- optimized instances. Poorly chosen experiments mislead the model On average, within 30% of final error after 4 experiments and within 2.3% after 6
  22. 22. Metadata extraction at the edge • Dynamic extraction pipelines composed of many independent extractors – Metadata and content (images, text, tables, maps, …) • Centralized vs edge extractor execution to weigh tradeoffs between compute and transfer costs 23 ƒuncX Model registry Flows Cost map Write programs Function fabric Data/Trust fabric Automate DLHub Globus SCRIMP Metadata Extraction Xtract
  23. 23. DLHub: model publication and serving dlhub.orgarxiv.org/abs/1811.11213 ƒuncX Model registry Flows Cost map Write programs Function fabric Data/Trust fabric Automate DLHub Globus SCRIMP Metadata Extraction Xtract
  24. 24. Assets: RNAseq, variants, patient phenotypes, expression profiles to small molecules At multiples sites: Managed/hosted by specialists Goals: Increase discoverability Combine, reuse, share assets Increase analysis, enabling clinical research NIH Common Fund Data Ecosystem Data automation Data Ingest Index Search Analyze
  25. 25. Product updates
  26. 26. Simplifying the Globus Connect Personal Experience • Option to login in from the application during installation • Setup key method available for automation use cases • Available next week
  27. 27. Simplifying the Globus Connect Personal Experience
  28. 28. Simplifying the Globus Connect Personal Experience
  29. 29. The new Globus Connect v5 architecture provides numerous new features for users and administrators, and serves as a platform for richer data management capabilities. 30
  30. 30. For users and developers • Web addressable storage system in addition to bulk data access • Credential management for cloud storage systems • No re-authentication needed for duration of tasks • Eliminate user certificates and move to OAuth tokens • … 31
  31. 31. For administrators • Single DTN pool connect multiple storage systems • Eliminate need for shared file system across DTNs • Complete backup and recovery solution • Configuration management API • … 32
  32. 32. Next point release GCSv5.4 • Targeted for May 2020 • Deployments with multiple DTNs • Support both standard data access and high assurance access • Custom mapping from user identity (user@domain.edu) to local account • Role based management for GCS • Guest collection root selection via browse • Connectors supported: – POSIX, Google Drive, Google Cloud, Box, Ceph, AWS S3 SpectraLogic Black Pearl
  33. 33. GCSv5 - Multiple DTNs architecture 34
  34. 34. 35 Globus Connect Server v5 • Continue to add features as point releases • Migration tools from v4 to v5
  35. 35. Access Google Cloud Storage and other on-prem/cloud storage via the same familiar, interface
  36. 36. Data-appropriate storage Google Drive for project admin files Google Cloud Storage for core research files
  37. 37. Fire-and forget transfers to Google storage resources e.g. automatic retry on errors
  38. 38. Maximize value of your Google cloud investment Including share data with collaborators
  39. 39. Continue to grow S3 compatible systems
  40. 40. Globus Connectors ActiveScale Object Storage
  41. 41. Growing the connector ecosystem
  42. 42. Other product updates • For users: Several new features in web app – Consolidated view options, HTTPS upload/download via browser, custom message on access, accessibility improvements… • For admins: Transfer updates for checksum handling – Support for additional algorithms (SHA1, SHA256, SHA512), custom checksum value to verify file integrity • For developers: Globus Groups platform service – First release with minimal feature to get group membership information 43
  43. 43. Some of the research we’re enabling…
  44. 44. NIH data access at scale for cancer researchers
  45. 45. DataCite switches to Globus Auth for authentication • Globus Auth to secure their Profiles services • Brings federated login to DataCite users • Ongoing collaboration to use Globus Auth for securing other API • Globus to use DataCite for persistent identifiers 46 blog.datacite.org/globus-authentication
  46. 46. Cancer Registry Records for Research (CR3) • Vision: enable broad, controlled, access to cancer patient data • Solution: Build a network of federated cancer registries – Self service data exploration across registries – Secure, auditable, access controls for data sharing • Federation via Globus: network scale  local control – Owners input/export data, apply QC, set access policies – Registry data remain at generating institution – Identities provided/authenticated by the institution
  47. 47. CR3 Discovery Portal Cohort aggregate counts Login with UPMC/Pitt credentials SearchAuth UPMC/Pitt Identity Providers Authentication Auth initiated to Globus Auth Cohort search initiated to Globus Search Researcher Cohort aggregate counts returned CR3 Architecture Transfer Registry Staff Data transfer from registrar to researcher mediated by Globus Manage authorization Request Service Cancer Registry De-identified Data Index (minimal criteria data: e.g., staging)
  48. 48. Programmatic adoption of Globus 49 “…over 60 research groups …moving over 2PB of data off aging near-line storage…” “Globus sharing and group functionality have also eased the thorny issue of sharing access with remote collaborators in a more controlled manner.” www.technology.pitt.edu/blog/globus
  49. 49. Instrument data delivery at scale Use Globus to deliver 100s of TB of genomic data to researchers Credits: Joe George, University of Michigan
  50. 50. Simplified data sharing for ALCF users Argonne Leadership Computing Facility (ALCF) ​“Eagle” provides a 50 PB community file system to make data-sharing easier than ever among ALCF users, their collaborators and with third parties. Eagle Community File System Globus sharing
  51. 51. Looking ahead…
  52. 52. Current service enhancements • MFA policy for data access • IPv6 support • Conditional fault handling • Enhancements for storage with staging requirements 53 • Enhancements to application registration and management • Groups service – Membership API – Management API
  53. 53. Platform Challenge 54 Transform how research applications and services are… created, used and delivered orchestrated to achieve automation sustained Enable an interoperable ecosystem of research applications and services
  54. 54. Globus platform services • Identity and Access Management (IAM) – Auth – Groups • Data Services – Connect – Transfer – Manifest • Search • Identifiers (collaboration with DataCite) • Flows 55
  55. 55. Globus Platform: Automation 56
  56. 56. Automation Action Providers Delete ACLs Search DLHub User Form Notification Expression Evaluation Describe Web FormIdentifier Transfer Ingest Xtract funcX Globus action providers Custom action providers
  57. 57. Enabling serial crystallography at scale • Serially image chips with thousands of embedded crystals • Quality control first 1,000 to report failures • Analyze batches of images as they are collected • Report statistics and images during experiment • Return crystal structure to scientist Darren Sherrell, Gyorgy Babnigg, Andrzej Joachimiak
  58. 58. SSX Automation funcX Analyze Transfer Return results Auth Get credentials funcX Preprocess Stop? Threshold Transfer Transfer data Publish Publish results
  59. 59. PaaS: develop custom action providers • Directly use the platform to build and run extensible flows • Develop action providers – Fit for purpose – Developed and deployed by the project – Plugged into their flows • Action Provider Development toolkit 60
  60. 60. XPCS: X-ray Photon Correlation Spectroscopy ALCF Data Portal Argonne JLSE Argonne Leadership Computing Facility APS Publication5 Lab Server 1 Acquisition2Imaging1 Plot results4 XPCS-Eigen3 Science!6 ● Automate flows stage data to ALCF for on- demand analysis and publication ● Metadata and plots dynamically extracted, and published into a search catalog ● Scientists can select datasets and initiate flows to perform batch analysis tasks Suresh Narayanan, Nicholas Schwarz
  61. 61. Automating XPCS Search Ingest funcX Auth Get credentials Automate Plot Results Transfer Transfer HDF5 Transfer Transfer IMM funcX Run Corr Share Set ACL Transfer Return Results
  62. 62. SaaS: instrument data management • Templated solution • Configurable… – Set transfer triggers – Select destination(s) – Define metadata • Extensible… – Add/remove actions – Change action providers • No development required Cryo EM Lightsheet Sequencer …. Indexing for search Image reconstruction, analysis, visualization Automated egress from device --/cohort045 --/cohort096 --/cohort127 Transfer funcXXtract
  63. 63. Materials Data Facility > 40 TB of data > 320 published authors > 400 datasets • Accept data from many locations with flexible interfaces • Index dataset contents in science-aware ways • Dispatch data to the community • Using Automate to simplify building composable flows of services
  64. 64. MDF Data Publication Automation Ingest Bulk Ingest Auth Get Credentials Automate Transfer Transfer Dataset XTract Extract Metadata Share Set permissions Transfer Move metadata Transfer Transfer Dataset Transfers Transfer Dataset Identifier Mint DOI Web form Metadata Notify Notify Curator Web form Curation Notify Notify user
  65. 65. SaaS: Data Management Plans • “Turnkey” DMP enablement • Select dataset (collection)… • …add metadata for indexing • …generate persistent ID (DOI, ARK, etc.) 66 Transfer Identifier Ingest “Point & Click” to findable and accessible data
  66. 66. 67 Data portals currently leveraging the platform
  67. 67. Sustainability Update
  68. 68. Why subscribe?
  69. 69. To go (way) beyond file transfer… • Remove friction for external collaborators • Automate/scale research data flows • Diversify research storage options—with a unified interface • Gain visibility into research storage utilization • Integrate robust data management into research apps • Optimize data transfer performance • Access expert support resources 70
  70. 70. To help our community share the load… 0 1000 2000 3000 4000 5000 6000 2015/04 2015/08 2015/12 2016/04 2016/08 2016/12 2017/04 2017/08 2017/12 2018/04 2018/08 2018/12 2019/04 2019/08 2019/12 Active Endpoints by Month Subscribed Free
  71. 71. Thank you, funders... U . S . D E P A R T M E N T O F ENERGY
  72. 72. Thank you, GlobusWorld sponsors

×