Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ian Foster
foster@uchicago.edu
JupyterCon, New York, August 23, 2018
Scaling collaborative
data science
with Globus and Ju...
Andre Schleife
UIUC
Modeling stopping power
with time-dependent
density functional theory
Hydrogen in Gold, v=2.0
16,000 C...
@python_app
Logan Ward
But data are big and distributed, and
our science is collaborative
(1) Query
(2) Transfer
(3) Learn
materialsdatafacility....
operated by UChicago for researchers worldwide
Auto-
mate
globus.org
Globus services
• Multi-user Hub
• Configurable HTTP proxy
• Multiple single-user Jupyter
notebook servers
Recall: JupyterHub components
H...
• Multi-user Hub
• Configurable HTTP proxy
• Multiple single-user Jupyter
notebook servers
Recall: JupyterHub components
W...
Securing JupyterHub with Globus Auth
We provide a simple
Globus OAuth plugin
• 100s of identity providers
(can restrict wh...
Use within JupyterHub is easy
https://github.com/jupyterhub/oauthenticator#globus-setup
Tokens are easily used within notebooks
Login
REST APIs
{“tokens”:…
{“tokens”:…
REST APIs
REST APIs
Bearer a45cd…
Globus T...
In particular, you can access Globus services
Globus Transfer
• Uniform access to
distributed storage (Posix,
S3, Ceph, HP...
In particular, you can access Globus services
Globus Search
• Cloud-hosted, schema agnostic
• Scale to billions of objects...
Demonstration
What we’re going to do:
• Login into our JupyterCon JupyterHub*
• Launch (spawn) a Notebook Server
• Get tok...
https://jupyter.demo.globus.org/
Login to Start Tutorial
The story so far …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be use...
But wait, there’s more!
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can b...
Container
Registry
I AM
cont ai ner met adat a
cont ai ner r eci pes
ALCF Petrel
cont ai ner s
Supercomputer
compute
compu...
And more …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in not...
Python parallel library
• Tasks exposed as
functions (Python or bash)
• Python code to glue functions together
• Globus fo...
@python_app
And more …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in not...
globus.org
And more …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in not...
Juan David Garrido
Browse data on
local storage
Search remote
storage systems
Select files on
remote storage
Transfer data to
local storage
There it is!
Search remote
databases
Select materials
data
Inspect
materials data
Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana
Ananthakrishnan
Ryan Chard Mike Papka Rick Wagner
I repor...
For more information
“Globus APIs enable authentication, data access, data
movement, data sharing, data search”
 See http...
Scaling collaborative data science with Globus and Jupyter
Upcoming SlideShare
Loading in …5
×

of

Scaling collaborative data science with Globus and Jupyter Slide 1 Scaling collaborative data science with Globus and Jupyter Slide 2 Scaling collaborative data science with Globus and Jupyter Slide 3 Scaling collaborative data science with Globus and Jupyter Slide 4 Scaling collaborative data science with Globus and Jupyter Slide 5 Scaling collaborative data science with Globus and Jupyter Slide 6 Scaling collaborative data science with Globus and Jupyter Slide 7 Scaling collaborative data science with Globus and Jupyter Slide 8 Scaling collaborative data science with Globus and Jupyter Slide 9 Scaling collaborative data science with Globus and Jupyter Slide 10 Scaling collaborative data science with Globus and Jupyter Slide 11 Scaling collaborative data science with Globus and Jupyter Slide 12 Scaling collaborative data science with Globus and Jupyter Slide 13 Scaling collaborative data science with Globus and Jupyter Slide 14 Scaling collaborative data science with Globus and Jupyter Slide 15 Scaling collaborative data science with Globus and Jupyter Slide 16 Scaling collaborative data science with Globus and Jupyter Slide 17 Scaling collaborative data science with Globus and Jupyter Slide 18 Scaling collaborative data science with Globus and Jupyter Slide 19 Scaling collaborative data science with Globus and Jupyter Slide 20 Scaling collaborative data science with Globus and Jupyter Slide 21 Scaling collaborative data science with Globus and Jupyter Slide 22 Scaling collaborative data science with Globus and Jupyter Slide 23 Scaling collaborative data science with Globus and Jupyter Slide 24 Scaling collaborative data science with Globus and Jupyter Slide 25 Scaling collaborative data science with Globus and Jupyter Slide 26 Scaling collaborative data science with Globus and Jupyter Slide 27 Scaling collaborative data science with Globus and Jupyter Slide 28 Scaling collaborative data science with Globus and Jupyter Slide 29 Scaling collaborative data science with Globus and Jupyter Slide 30 Scaling collaborative data science with Globus and Jupyter Slide 31 Scaling collaborative data science with Globus and Jupyter Slide 32 Scaling collaborative data science with Globus and Jupyter Slide 33 Scaling collaborative data science with Globus and Jupyter Slide 34 Scaling collaborative data science with Globus and Jupyter Slide 35
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Scaling collaborative data science with Globus and Jupyter

Download to read offline

The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Scaling collaborative data science with Globus and Jupyter

  1. 1. Ian Foster foster@uchicago.edu JupyterCon, New York, August 23, 2018 Scaling collaborative data science with Globus and Jupyter
  2. 2. Andre Schleife UIUC Modeling stopping power with time-dependent density functional theory Hydrogen in Gold, v=2.0 16,000 CPU-hours per simulation SampleExperimental sca ering Material composi on Simulated structure Simulated sca ering La 60% Sr 40% Evolu onary op miza on 786,432 CPUs, 10 PFLOPS (1016 flops) supercomputer Argonne Leadership Computing Facility
  3. 3. @python_app Logan Ward
  4. 4. But data are big and distributed, and our science is collaborative (1) Query (2) Transfer (3) Learn materialsdatafacility.org petrel.alcf.anl.gov Cooley: 290 TeraFLOPS (4) Share 2 PB, 80 Gbps Globus-enabled store 3.2M materials data We need multi-credential, multi-service authentication and big data management
  5. 5. operated by UChicago for researchers worldwide Auto- mate globus.org Globus services
  6. 6. • Multi-user Hub • Configurable HTTP proxy • Multiple single-user Jupyter notebook servers Recall: JupyterHub components Hub Configurable HTTP proxy Authenticator User database Spawner Notebook /api/auth Browser /hub/ /user/[name]/
  7. 7. • Multi-user Hub • Configurable HTTP proxy • Multiple single-user Jupyter notebook servers Recall: JupyterHub components We want to grant notebooks access to the world • Tokens for remote services • APIs for remote actions: e.g., Globus data management Hub Configurable HTTP proxy Authenticator User database Spawner Notebook /api/auth Browser /hub/ /user/[name]/ Cooley Globus Petrel
  8. 8. Securing JupyterHub with Globus Auth We provide a simple Globus OAuth plugin • 100s of identity providers (can restrict which ones) • 1000 registered clients, apps • Custom scopes • Tokens passed into notebook environment JupyterHub OAuthenticator
  9. 9. Use within JupyterHub is easy https://github.com/jupyterhub/oauthenticator#globus-setup
  10. 10. Tokens are easily used within notebooks Login REST APIs {“tokens”:… {“tokens”:… REST APIs REST APIs Bearer a45cd… Globus Transfer Globus Search Globus Publish Your App Another App Hub Configurable HTTP proxy Authenticator User database Notebook /hub/ /user/[name]/ Spawner /api/auth Browser
  11. 11. In particular, you can access Globus services Globus Transfer • Uniform access to distributed storage (Posix, S3, Ceph, HPSS, Google Drive, Hadoop, Lustre,, …) • HTTPS; GridFTP for high- speed, reliable, third-party transfers • Shared endpoints: User- managed access control • Web, REST, CLI access • HIPAA compliant 12,000 active Globus Connect endpoints (including most universities and labs)
  12. 12. In particular, you can access Globus services Globus Search • Cloud-hosted, schema agnostic • Scale to billions of objects Globus Identifiers • Digital object ids for your data • DataCite or other metadata Globus Publication platform • Customized publication pipelines Canadian Federated Research Data Repository: https://frdr.ca/
  13. 13. Demonstration What we’re going to do: • Login into our JupyterCon JupyterHub* • Launch (spawn) a Notebook Server • Get tokens • Access some Globus APIs • Download some data • Plot it • PUT the result on an HTTPS endpoint *Zero to JupyterHub: Fast JupyterHub on Kubernetes https://zero-to-jupyterhub.readthedocs.io
  14. 14. https://jupyter.demo.globus.org/ Login to Start Tutorial
  15. 15. The story so far … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource
  16. 16. But wait, there’s more! • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC
  17. 17. Container Registry I AM cont ai ner met adat a cont ai ner r eci pes ALCF Petrel cont ai ner s Supercomputer compute compute compute compute JupyterHub Notebook Server Containers are staged to local file systems Users select the container to execute their custom Jupyter environment The same containers can be used for both Jupyter notebook server and compute nodes, for consistency Unified IAM platform scalable for distributed projects Container definitions are tracked in version control systems A registry for container discovery and referencing Containers can be used for other tasks: analysis; ML; etc. Containers everywhere
  18. 18. And more … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl
  19. 19. Python parallel library • Tasks exposed as functions (Python or bash) • Python code to glue functions together • Globus for auth and data movement (Data) science applications require: • Interactivity • Scalability - Need more than a desktop • Reproducibility - Publish code and documentation Our solution: JupyterHub + Parsl  Interactive computing environment  Notebooks for publication  Can run on dedicated hardware parsl-project.org Interactive, scalable, reproducible data analysis @python_app def compute_features(chunk): for f in featurizers: chunk = f.featurize_dataframe(chunk, 'atoms') return chunk chunks = [compute_features(chunk) for chunk in np.array_split(data, chunks)]
  20. 20. @python_app
  21. 21. And more … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl • Jupyter notebooks for rules-based automation – Notebooks are triggered by events (e.g., new data available) – Notebooks trigger events (e.g., computation completed)
  22. 22. globus.org
  23. 23. And more … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl • Jupyter notebooks for rules-based automation • Integration with JupyterLab (student summer project)
  24. 24. Juan David Garrido
  25. 25. Browse data on local storage
  26. 26. Search remote storage systems
  27. 27. Select files on remote storage
  28. 28. Transfer data to local storage
  29. 29. There it is!
  30. 30. Search remote databases
  31. 31. Select materials data
  32. 32. Inspect materials data
  33. 33. Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana Ananthakrishnan Ryan Chard Mike Papka Rick Wagner I reported on the work of many talented people And others from the Globus team, the University of Chicago, and Argonne Nat Lab We are grateful to our sponsors DLHub Globus IMaD Petrel Argonne Leadership Computing Facility At JupyterCon 2018
  34. 34. For more information “Globus APIs enable authentication, data access, data movement, data sharing, data search”  See https://docs.globus.org “Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource”  Tutorial: https://jupyter.demo.globus.org Blog: https://www.globus.org/blog/using-globus-jupyter-notebooks “But wait, there’s more!”  Talk to me, or check back in a few months foster@uchicago.edu

The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.

Views

Total views

1,051

On Slideshare

0

From embeds

0

Number of embeds

33

Actions

Downloads

16

Shares

0

Comments

0

Likes

0

×