Successfully reported this slideshow.
Your SlideShare is downloading. ×

Foundations for the Future of Science

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 87 Ad

Foundations for the Future of Science

Download to read offline

Keynote presentation at GlobusWorld 2021. Highlights product updates and roadmap, as well as user success stories in research data management. Presented by Ian Foster, Rachana Ananthakrishnan, Kyle Chard and Vas Vasiliadis.

Keynote presentation at GlobusWorld 2021. Highlights product updates and roadmap, as well as user success stories in research data management. Presented by Ian Foster, Rachana Ananthakrishnan, Kyle Chard and Vas Vasiliadis.

Advertisement
Advertisement

More Related Content

Slideshows for you (19)

Similar to Foundations for the Future of Science (20)

Advertisement

More from Globus (20)

Recently uploaded (20)

Advertisement

Foundations for the Future of Science

  1. 1. Foundations for the future of science Ian Foster, Rachana Ananthakrishnan, Kyle Chard, Vas Vasiliadis GlobusWorld - May 12, 2021
  2. 2. Serial Synchrotron Crystallography of SARS-CoV-2 proteins
  3. 3. The COVID’19 data pipeline: HPC, ML, people developing machine readable datasets for small molecule libraries CHEMICAL LIBRARY DATABASE AND MORE known molecules 4B COMPUTING RESOURCES CANONICALIZATION COMPUTE FEATURES DEEP LEARNING FILTERING FINGERPRINTING SIMILARITY SEARCH GENERATE IMAGES CNN FILTERING Yadu Babuji, Ben Blaiszik, Kyle Chard, Ryan Chard, Ian Foster, Logan Ward, Tom Brettin et al
  4. 4. A National Pandemic Observatory
  5. 5. Ingest Annotate,Assemble, Align,Interpolate, Normalize Introspect Correct,Calibrate Characterize,Detect Anomalies FAIR Data Commons Intelligent Edge AdaptiveSampling &EdgeComputing Data Sources Data Product Create,Publish Catalog,Version, ShareDOI Experiment Engine Active Data Path Continuous Reanalysis Active Learning Simulation Scientists, Public, DecisionMakers 8
  6. 6. The next frontier? “AI for science” “Most of the modeling and prediction necessary to produce the next generation of breakthroughs in science, energy, medicine, and national security will come not from applying traditional theory, but from employing data-driven methods at extreme scale tightly coupled to experiments and scientific user facilities.” — US Department of Energy FY 2021 Congressional Budget Justification
  7. 7. Why am I excited about “AI for science”? Push • Step changes in AI/ML methods, notably deep neural networks • Major advances in areas like machine translation, speech recognition, image processing • New hardware specialized for deep neural networks Pull • Exploding volumes of data due to new sensors and instrumentation exceed human capabilities • End of Moore’s Law puts hard problems out of reach • Growing complexity of science and engineering problems slowing rate of discovery Why are we excited about “AI for science”? Push • Step changes in AI/ML methods, notably deep neural networks • Major advances in areas like machine translation, speech recognition, image processing • New hardware specialized for deep neural networks Pull • Exploding volumes of data due to new sensors and instrumentation exceed human capabilities • End of Moore’s Law puts hard problems out of reach • Growing complexity of science and engineering problems slowing rate of discovery
  8. 8. AI Enabled Experimental Workflows (how to make it) …materials, polymers, organisms… …self-driving labs, synthesis search… • data Sets • literature • science “news” • strategy Cleaned Updated Annotated Aggregated Interpreted AI Enabled Scientific Comprehension (what it means) AI-Enabled Design Workflows (what to make) Insight? AI Science Applications: One per Planet
  9. 9. Augmented Simulations Design Control Science and Math Comprehension Generative Models Inverse Problems Multimodal Learning Decision- Making Materials Biology Chemistry Devices Batteries Drugs Waveforms Text Images Structured Graphs Time- series Image2Phase Spectra 2 Structures Waveform 2 Source Detector Simulations Cosmology Biodesign Experiments Accelerators Reactors Mobility Simulation Energy Landscape Search Surrogates Optimize Mathematics Physics Biochemistry Risk Assessment Research Priorities The Next Problem AI for Science: AI Building Blocks (examples)
  10. 10. Protein engineering Liquid-handling robot SAXS, SA- XPCS: 8-ID-I Beamline Digital twin + AI components Robotic pendant drop Screen ~108 conditions for LLPS Screen ~104 combos for LLPS (turbidity, confocal microscope imaging) Screen ~102 combos at various temperatures Selected matrixes (e.g., salt, pH, PEG) Stock proteins (different periods, repeats) X-ray Info transfer and control, demonstrated Information transfer and control, not yet demonstrated Material transfer, not yet demonstrated Change sample Measure sample HPC simulation Compute ~105 properties ALCF APCF APS Arvind Ramanathan et al. Example: Rational design of intrinsically disordered polypeptides
  11. 11. AI for science means rethinking infrastructure 15 Infrastructure for AI-enabled Science Scientific instruments Major user facilities Laboratories Automated labs … Sensors Environmental Laboratories Mobile … Simulation codes Computational results Function memoization … Databases Reference data Experimental data Computed properties Scientific literature … Scientists, engineers Expert input Goal setting … Industry, academia New methods Open source codes AI accelerators … Data ingest Inference HPO Data enhancement Data QA/QC Feature selection Model training UQ Model reduction Active/ reinforcement learning Artificial Intelligence Methods Data Models Accelerators Compute Agile Infrastructure Surrogates System Software Data mgmt Operating system Portability Compilers Runtime system Workflow Automation Prog. envs. Languages Model creation Libraries Resource mgmt Authen/Access
  12. 12. Diverse impacts across the globe 16
  13. 13. Understanding SARS-CoV-2 Protein Structure 17 “These data services have taken the time to solve a structure from weeks to days and now to hours” Darren Sherrell, SBC beamline scientist APS Sector 19
  14. 14. Data Management at Cyro-EM Facilities 18 Case Western Reserve – Cryo-EM Core Credit: https://case.edu/medicine/research/som-core- facilities/cryo-electron-microscopy-core Credit: https://pncc.labworks.org/about-us Pacific Northwest Cryo-EM Processing Center (PNNL and Oregon Health Sciences University) Globus for – automated data sync as new data is collected – provisioning of data access for researchers – reliable, secure data access for users – Monitoring and management via console
  15. 15. The Bioinformatics Core of the Lineberger Comprehensive Cancer Center at the University of North Carolina Global data distribution at bioinformatics core – Multiple research projects use Globus for data sharing with external collaborators – Support wide variety of projects: different locations, sources, sizes, cancer types, institution type, storage systems, and identities
  16. 16. Digital agriculture – University of Winnipeg • Increasing crop yields using machine learning models • Building training data sets – 40K images per day, tagged with metadata – Move data from diverse sources to campus storage, then onto Compute Canada HPC to run models • Orchestrate data transfer using Globus CLI 20 Credit: Dilbarjot and Michael Beck, Physics and Applied Computer Science , University of Winnipeg
  17. 17. Dark Energy Science Collaboration • Preparation for the arrival of the Rubin Observatory • Data Challenge 2: extreme-scale simulation of 300 sq degree patch of the sky over five years – 5 TB of data – ~90M core house at ALCF and NERSC • Data Portal based on Globus makes data accessible to collaborators 21
  18. 18. Federated Research Data Repository • National Research Data Management platform, where data can be – Ingested, curated, and preserved – Discovered, cited, and shared • Globus Services – Authentication – Transfer to repository service – Search for metadata catalog for data discovery (includes metadata from 70 other repositories) 22
  19. 19. Rebuilding A Kidney GPCR GUDMAP Synapse FaceBase ● DERIVA is an asset management platform for science used in various biomedical data repositories ● Globus Auth for authentication with external identities ● Globus groups for roles (e.g., curator, viewer, administrator) ● Globus Auth for desktop GUI and CLI DERIVA
  20. 20. 24 Data Provider Models / Functions API layer API layer Data Publishers Model Publishers Consumers Science! Increasing Data Interoperability & Reusability From foundry import Foundry f = Foundry() X,y = f.load(“dataset1”, v=“1.0”) y_pred = f.run(“model1”, v=“1.0”, X) f.data.publish(“./” “dataset1”, v=“1.1”) f.model.publish(“./” “model1”, v=“1.1”) • Models run locally or on distributed endpoints • Capabilities to pull datasets to desired location or move compute to desired location Dataset Function CH MaD • Radically reduce the energy barrier to access curated ML datasets and ML models • Facilitate reuse, meta-studies, benchmarking, and more • Long term implications for education NSF CSSI Started Oct. 2019 (Dane Morgan, Paul Voyles, Michael Ferris, Marcus Schwarting, Ben Blaiszik)
  21. 21. National cyberinfrastructure adoption 25
  22. 22. Enabled by the Globus data platform Researcher initiates transfer request; or requested automatically by script, science gateway 1 Instrument Compute Facility Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Automating research workflows and ensuring those that need access to the data have it. 8 Personal Computer Transfer Share • Use a Web browser or platform services • Access any storage • Use an existing identity Build The Globus Command Line Interface, API sets, and Python SDK provide a platform… 6 … for building science gateways, portals and publication services. 7
  23. 23. Enabling the next wave 27
  24. 24. Globus Search Globus Search • Scalable, secure search for research data • Features: – Metadata store with fine- grained visibility controls – Schema agnostic – Free text and faceted search – Integrated with Globus research platform (Auth, Groups) 28 Input form Extract Metadata Ingest metadata, set visibility polices Discovery POST /index/123 { "filters": [ { "field_name": "record_year", "values": ["2020"],"type": "match_all” }, { "field_name": "temp_farenheit", "values": [{"from": 90, "to": "*"}]"type": "range" ] } Query Bulk ingest
  25. 25. Example: Cosmology 29
  26. 26. Globus Search • Documentation: docs.globus.org/api/search • SDK: globus-sdk-python.readthedocs.io • CLI: pypi.org/project/globus-search-cli • Sample code and walkthrough: docs.globus.org/api/search/guides/searchable_files 30
  27. 27. 31 Globus automation services Managed, secure and reliable task orchestration across heterogenous resources, with declarative language for composition, and extensible to plugin custom actions, supporting an event driven execution model, for automation at scale
  28. 28. Create and deploy flows 32 • Define the flow and deploy to Flows service • Uses declarative language (JSON or YAML) • Set policy: visibility, runnable by Action 1 Action 2 Action 3 Action 4 Action 1 Action 2 Choice Action 4 Action 5 Action 3
  29. 29. Start and manage runs 33 • An instance of Flow execution – Provide input parameter – Check status – Cancel • Set policy: monitor, manager • Triggers to start flows
  30. 30. Build action providers 34 • Action Provider is a service endpoint – Run – Status – Cancel – Release – Resume • Action Provider Toolkit action-provider- tools.readthedocs.io/en/latest Search Transfer Notification ACLs Identifier Delete Ingest User Form Describe Xtract funcX Web Form Custom built Globus Provided
  31. 31. Automation services ecosystem GET /provider_url/ POST /provider_url/run GET /provider_url/action_id/status GET /provider_url/action_id/cancel GET /provider_url/action_id/status Create Action Providers Define and deploy flows { “StartAt”: ”ToProject”, ”States” : { ”ToProject” : { … }, ”SetPermission” : { …}, “ProcessData” : { … } … }} Run flows
  32. 32. Example: CFDE 36 Data Coordinating Centers User Data Portal Deposit metadata Index for discovery Powered by Globus Auth, Groups & Flows Common Fund Data Ecosystem
  33. 33. Example: High-Performance Ptychography Workflows Funding Sources: ASCR, BES
  34. 34. Automation services • Documentation: docs.globus.org/globus-automation-services • CLI: globus-automate-client.readthedocs.io • Python SDK: globus-automate-client.readthedocs.io • Sample flows visible to all users 38
  35. 35. (Re)laying the foundation: GCSv5 39
  36. 36. Globus Connect Server v5 • Feature parity with v4 • Custom DNS names (e.g. data.university.edu) • Multi-factor authentication policy • Enhanced sharing policy • Containerized deployment 40
  37. 37. Fire-and forget transfers Data sharing with collaborators
  38. 38. Partnership with the community to develop new connectors Community Connector Program
  39. 39. Easy egress and ingress of data Data sharing with collaborators Publish data
  40. 40. POSIX Staging Connector • For POSIX file system that cache from tertiary storage • Custom plug-in for staging files • Example: – IBM Spectrum Scale plugin, Brock Palen at University of Michigan - github.com/brockpalen/ltfsee-globus 44
  41. 41. Current connector landscape
  42. 42. Globus Groups • Groups platform in production • Administrators can add users, in addition to invite • Membership policies simplified groups.api.globus.org/redoc 46
  43. 43. Transfer and Sharing • Skip files with not found errors – List of skipped files once task is completed • Fail tasks with quota errors • Scheduled and replicated transfers – Manage scheduled/repeated transfer and sync tasks – pypi.org/project/globus-timer-cli 47
  44. 44. Leveraging the Globus data platform… 48
  45. 45. APS XPCS: secure data discovery 49 Globus Auth Globus Groups Globus Search
  46. 46. APS XPCS: data access & preview 50 Globus Transfer HTTPS access
  47. 47. APS XPCS: automated processing & indexing 51 Globus Flows: Transfer, analysis, and ingest to search index
  48. 48. 52 The (product) road ahead
  49. 49. Globus Connect • Tools to migrate from v4 to v5 – Migration in phases (Q2 – Q3) – Goal: not require end user intervention • IPv6 support • Connectors – Azure Blob – Intel DAOS 53
  50. 50. IAM and Data platform • Support use cases that need higher task throughput • Enhancements to data permissions management • Improvements to consent management • Integration with NIH Researcher Auth Service • Search service for high assurance tier • Leverage Search for Globus resources 54
  51. 51. Automation platform • Lower the barrier for adoption – Web interfaces – Supporting tools/libraries – Action Providers for all Globus functionality • Exemplar flows for common use cases – Instrument data management – Data publication • Supported in high assurance tier 55
  52. 52. Clients • Streamline SDK/CLI to across services • Web App – Updated management console – Accessibility standards • Enhancements to sample portal – Open source, for customization and deployment – Flask, Django 56
  53. 53. 57 Looking to the future
  54. 54. Building the compute foundation for Globus 58
  55. 55. Requirements for reliable, scalable, remote computing Researcher needs to run a computation on a remote PC, cloud, supercomputer 1. Compute Compute Facility Collaborator wants to run their colleague’s computation on another system closer to their data 3. Share Instrument 5. Build Gateway and application developers want to add remote computation to their code 2. Specialize Researcher needs to move it to a new system or architecture to improve performance 4. Community Access Collaborators want to share access to a single allocation to run compute tasks
  56. 56. Function as a Service (FaaS) Developers work in terms of programming functions 1. Pick a runtime (e.g., Python) 2. Register function code 3. Run (and scale) Low latency, on-demand, elastic scaling, easy to deploy and update 60 def compute(input_args): # do something return results
  57. 57. funcX: managed and federated FaaS • Cloud-hosted service for managing compute • Register and share compute endpoints • Register and share Python functions • Reliably, scalable, securely execute functions on remote endpoints • Integrated with Globus Auth and data ecosystem 61 Try funcx on Binder https://funcx.org
  58. 58. Transform laptops, clusters, clouds into function serving endpoints • Python-based agent and pip installable locally or in Conda • Elastically provisions resources from local, cluster, or cloud system • Manages concurrent execution on provisioned resources • Optionally manages execution in Docker, Singularity, Shifter containers • Share endpoints with collaborators 62 $ pip install funcx-endpoint $ funcx-endpoint configure myep $ funcx-endpoint start myep
  59. 59. Register and share functions Create funcX client (and authn) 63 def compute(input_args): # do something return results def compute(input_args): # do something return results def compute(input_args): # do something return results Define and register Python function
  60. 60. Execute tasks on any accessible endpoint Select: function ID, endpoint ID, and input arguments Retrieve results asynchronously (funcX stores results in the cloud) 64 F(ep1,1) F(ep1, 2) F(ep1, 3) F(ep1, 4) F(ep1, 5) F(ep1, 6) F(ep2, 7)
  61. 61. https://funcx.org https://mybinder.org/v2/gh/funcx-faas/examples/HEAD
  62. 62. Canonical research automation flow for instruments 69 Data Capture Data Analysis / Model in the Loop Publication Data Staging Metadata Extraction And Data Cataloging Data Staging Catalog Feedback Data Generation Examples • Serial X-Ray Crystallography • X-Ray Photon Correlated Spectroscopy • High energy diffraction microscopy • High throughput ptychography • High energy x-ray diffractions
  63. 63. Applying the Globus platform to science at the APS 70 Advanced Photon Source Key: funcX agent Globus Connect Theta Bebop Cluster Argonne Leadership Computing Facility Laboratory Computing Research Center Petrel store APS Computing Orthros Cluster APS DM system Portal server Portal server Cooley Action 1 Action 2 Action 3 Action 4
  64. 64. Example: Rapid Training of Deep Neural Networks using Remote Resources • DNN at the edge for fast processing, filtering, QC • Requires tight coupling with simulation and training with real-time data • Globus Flow: 71 Data Source HPC/DCAI Edge(Host) Globus, Automate Commands Status Data Model User Request Status Commands Status C/S Zhengchun Liu, Jana Thayar, et al. – Globus to rapidly move data for training – funcX for simulation and model training – Globus to move models to the edge – (Future) funcX for inference at the edge
  65. 65. Making this possible 72
  66. 66. Our Mission Increase the efficiency and effectiveness of researchers engaged in data-driven science and scholarship through sustainable software
  67. 67. 74 Active endpoints in over 70 countries
  68. 68. Adoption among R1 Institutions: 126 of 130 use Globus So, how are we doing? 75
  69. 69. Adoption among U.S. national laboratories So, how are we doing? 76
  70. 70. Notables… 77 BIG Movers 66.3 PB 2 Share 1,593 💛 Frequent Movers 887,000
  71. 71. Thank you, funders... U . S . D E P A R T M E N T O F ENERGY
  72. 72. Thank you to our Platinum sponsor!
  73. 73. Thank you Gold sponsors!
  74. 74. Thank you Gold sponsors!
  75. 75. Thank you Gold sponsors!
  76. 76. Thank you Gold sponsors!
  77. 77. Thank you Gold sponsors!
  78. 78. Thank you Gold sponsors!
  79. 79. Thank you Patron sponsors!
  80. 80. A word from our Platinum Sponsor Jordan Winkelman, Field Solutions CTO 89
  81. 81. Join us in Gather.Town • Get answers at the Globus Genius Bar • Visit the Sponsor Showcase • Joint the scavenger hunt in The Garden • Play a game bit.ly/globustown (passcode: globus) 90
  82. 82. #globusworld @globus

×