Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling collaborative data science with Globus and Jupyter

372 views

Published on

The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Scaling collaborative data science with Globus and Jupyter

  1. 1. Ian Foster foster@uchicago.edu JupyterCon, New York, August 23, 2018 Scaling collaborative data science with Globus and Jupyter
  2. 2. Andre Schleife UIUC Modeling stopping power with time-dependent density functional theory Hydrogen in Gold, v=2.0 16,000 CPU-hours per simulation SampleExperimental sca ering Material composi on Simulated structure Simulated sca ering La 60% Sr 40% Evolu onary op miza on 786,432 CPUs, 10 PFLOPS (1016 flops) supercomputer Argonne Leadership Computing Facility
  3. 3. @python_app Logan Ward
  4. 4. But data are big and distributed, and our science is collaborative (1) Query (2) Transfer (3) Learn materialsdatafacility.org petrel.alcf.anl.gov Cooley: 290 TeraFLOPS (4) Share 2 PB, 80 Gbps Globus-enabled store 3.2M materials data We need multi-credential, multi-service authentication and big data management
  5. 5. operated by UChicago for researchers worldwide Auto- mate globus.org Globus services
  6. 6. • Multi-user Hub • Configurable HTTP proxy • Multiple single-user Jupyter notebook servers Recall: JupyterHub components Hub Configurable HTTP proxy Authenticator User database Spawner Notebook /api/auth Browser /hub/ /user/[name]/
  7. 7. • Multi-user Hub • Configurable HTTP proxy • Multiple single-user Jupyter notebook servers Recall: JupyterHub components We want to grant notebooks access to the world • Tokens for remote services • APIs for remote actions: e.g., Globus data management Hub Configurable HTTP proxy Authenticator User database Spawner Notebook /api/auth Browser /hub/ /user/[name]/ Cooley Globus Petrel
  8. 8. Securing JupyterHub with Globus Auth We provide a simple Globus OAuth plugin • 100s of identity providers (can restrict which ones) • 1000 registered clients, apps • Custom scopes • Tokens passed into notebook environment JupyterHub OAuthenticator
  9. 9. Use within JupyterHub is easy https://github.com/jupyterhub/oauthenticator#globus-setup
  10. 10. Tokens are easily used within notebooks Login REST APIs {“tokens”:… {“tokens”:… REST APIs REST APIs Bearer a45cd… Globus Transfer Globus Search Globus Publish Your App Another App Hub Configurable HTTP proxy Authenticator User database Notebook /hub/ /user/[name]/ Spawner /api/auth Browser
  11. 11. In particular, you can access Globus services Globus Transfer • Uniform access to distributed storage (Posix, S3, Ceph, HPSS, Google Drive, Hadoop, Lustre,, …) • HTTPS; GridFTP for high- speed, reliable, third-party transfers • Shared endpoints: User- managed access control • Web, REST, CLI access • HIPAA compliant 12,000 active Globus Connect endpoints (including most universities and labs)
  12. 12. In particular, you can access Globus services Globus Search • Cloud-hosted, schema agnostic • Scale to billions of objects Globus Identifiers • Digital object ids for your data • DataCite or other metadata Globus Publication platform • Customized publication pipelines Canadian Federated Research Data Repository: https://frdr.ca/
  13. 13. Demonstration What we’re going to do: • Login into our JupyterCon JupyterHub* • Launch (spawn) a Notebook Server • Get tokens • Access some Globus APIs • Download some data • Plot it • PUT the result on an HTTPS endpoint *Zero to JupyterHub: Fast JupyterHub on Kubernetes https://zero-to-jupyterhub.readthedocs.io
  14. 14. https://jupyter.demo.globus.org/ Login to Start Tutorial
  15. 15. The story so far … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource
  16. 16. But wait, there’s more! • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC
  17. 17. Container Registry I AM cont ai ner met adat a cont ai ner r eci pes ALCF Petrel cont ai ner s Supercomputer compute compute compute compute JupyterHub Notebook Server Containers are staged to local file systems Users select the container to execute their custom Jupyter environment The same containers can be used for both Jupyter notebook server and compute nodes, for consistency Unified IAM platform scalable for distributed projects Container definitions are tracked in version control systems A registry for container discovery and referencing Containers can be used for other tasks: analysis; ML; etc. Containers everywhere
  18. 18. And more … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl
  19. 19. Python parallel library • Tasks exposed as functions (Python or bash) • Python code to glue functions together • Globus for auth and data movement (Data) science applications require: • Interactivity • Scalability - Need more than a desktop • Reproducibility - Publish code and documentation Our solution: JupyterHub + Parsl  Interactive computing environment  Notebooks for publication  Can run on dedicated hardware parsl-project.org Interactive, scalable, reproducible data analysis @python_app def compute_features(chunk): for f in featurizers: chunk = f.featurize_dataframe(chunk, 'atoms') return chunk chunks = [compute_features(chunk) for chunk in np.array_split(data, chunks)]
  20. 20. @python_app
  21. 21. And more … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl • Jupyter notebooks for rules-based automation – Notebooks are triggered by events (e.g., new data available) – Notebooks trigger events (e.g., computation completed)
  22. 22. globus.org
  23. 23. And more … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl • Jupyter notebooks for rules-based automation • Integration with JupyterLab (student summer project)
  24. 24. Juan David Garrido
  25. 25. Browse data on local storage
  26. 26. Search remote storage systems
  27. 27. Select files on remote storage
  28. 28. Transfer data to local storage
  29. 29. There it is!
  30. 30. Search remote databases
  31. 31. Select materials data
  32. 32. Inspect materials data
  33. 33. Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana Ananthakrishnan Ryan Chard Mike Papka Rick Wagner I reported on the work of many talented people And others from the Globus team, the University of Chicago, and Argonne Nat Lab We are grateful to our sponsors DLHub Globus IMaD Petrel Argonne Leadership Computing Facility At JupyterCon 2018
  34. 34. For more information “Globus APIs enable authentication, data access, data movement, data sharing, data search”  See https://docs.globus.org “Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource”  Tutorial: https://jupyter.demo.globus.org Blog: https://www.globus.org/blog/using-globus-jupyter-notebooks “But wait, there’s more!”  Talk to me, or check back in a few months foster@uchicago.edu

×