Ian Foster
foster@uchicago.edu
JupyterCon, New York, August 23, 2018
Scaling collaborative
data science
with Globus and Jupyter
Andre Schleife
UIUC
Modeling stopping power
with time-dependent
density functional theory
Hydrogen in Gold, v=2.0
16,000 CPU-hours per simulation
SampleExperimental
sca ering
Material
composi on
Simulated
structure
Simulated
sca ering
La 60%
Sr 40%
Evolu onary op miza on
786,432 CPUs, 10 PFLOPS (1016 flops) supercomputer
Argonne Leadership Computing Facility
@python_app
Logan Ward
But data are big and distributed, and
our science is collaborative
(1) Query
(2) Transfer
(3) Learn
materialsdatafacility.org
petrel.alcf.anl.gov
Cooley: 290 TeraFLOPS
(4) Share
2 PB, 80 Gbps Globus-enabled store
3.2M materials data
We need multi-credential,
multi-service authentication
and big data management
operated by UChicago for researchers worldwide
Auto-
mate
globus.org
Globus services
• Multi-user Hub
• Configurable HTTP proxy
• Multiple single-user Jupyter
notebook servers
Recall: JupyterHub components
Hub
Configurable HTTP proxy
Authenticator
User database
Spawner
Notebook
/api/auth
Browser
/hub/
/user/[name]/
• Multi-user Hub
• Configurable HTTP proxy
• Multiple single-user Jupyter
notebook servers
Recall: JupyterHub components
We want to grant notebooks
access to the world
• Tokens for remote services
• APIs for remote actions: e.g.,
Globus data management Hub
Configurable HTTP proxy
Authenticator
User database
Spawner
Notebook
/api/auth
Browser
/hub/
/user/[name]/
Cooley Globus
Petrel
Securing JupyterHub with Globus Auth
We provide a simple
Globus OAuth plugin
• 100s of identity providers
(can restrict which ones)
• 1000 registered clients, apps
• Custom scopes
• Tokens passed into notebook
environment
JupyterHub OAuthenticator
Use within JupyterHub is easy
https://github.com/jupyterhub/oauthenticator#globus-setup
Tokens are easily used within notebooks
Login
REST APIs
{“tokens”:…
{“tokens”:…
REST APIs
REST APIs
Bearer a45cd…
Globus Transfer
Globus Search
Globus Publish
Your App
Another App
Hub
Configurable HTTP proxy
Authenticator
User database Notebook
/hub/
/user/[name]/
Spawner
/api/auth
Browser
In particular, you can access Globus services
Globus Transfer
• Uniform access to
distributed storage (Posix,
S3, Ceph, HPSS, Google
Drive, Hadoop, Lustre,, …)
• HTTPS; GridFTP for high-
speed, reliable, third-party
transfers
• Shared endpoints: User-
managed access control
• Web, REST, CLI access
• HIPAA compliant 12,000 active Globus Connect endpoints
(including most universities and labs)
In particular, you can access Globus services
Globus Search
• Cloud-hosted, schema agnostic
• Scale to billions of objects
Globus Identifiers
• Digital object ids for your data
• DataCite or other metadata
Globus Publication platform
• Customized publication pipelines Canadian Federated Research
Data Repository: https://frdr.ca/
Demonstration
What we’re going to do:
• Login into our JupyterCon JupyterHub*
• Launch (spawn) a Notebook Server
• Get tokens
• Access some Globus APIs
• Download some data
• Plot it
• PUT the result on an HTTPS endpoint
*Zero to JupyterHub: Fast JupyterHub on Kubernetes
https://zero-to-jupyterhub.readthedocs.io
https://jupyter.demo.globus.org/
Login to Start Tutorial
The story so far …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
But wait, there’s more!
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
• Create a containerized data science ecosystem that
encompasses laptops, servers, clouds, HPC
Container
Registry
I AM
cont ai ner met adat a
cont ai ner r eci pes
ALCF Petrel
cont ai ner s
Supercomputer
compute
compute
compute
compute
JupyterHub
Notebook Server
Containers are staged
to local file systems
Users select the container
to execute their custom
Jupyter environment
The same containers can be used for
both Jupyter notebook server and
compute nodes, for consistency
Unified IAM platform
scalable for
distributed projects
Container definitions
are tracked in version
control systems
A registry for container
discovery and referencing
Containers can be used for
other tasks: analysis; ML; etc.
Containers
everywhere
And more …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
• Create a containerized data science ecosystem that
encompasses laptops, servers, clouds, HPC
• Incorporate seamless parallel computing via Parsl
Python parallel library
• Tasks exposed as
functions (Python or bash)
• Python code to glue functions together
• Globus for auth and data movement
(Data) science applications require:
• Interactivity
• Scalability
- Need more than a desktop
• Reproducibility
- Publish code and documentation
Our solution: JupyterHub + Parsl
 Interactive computing environment
 Notebooks for publication
 Can run on dedicated hardware
parsl-project.org
Interactive, scalable, reproducible data analysis
@python_app
def compute_features(chunk):
for f in featurizers:
chunk = f.featurize_dataframe(chunk, 'atoms')
return chunk
chunks = [compute_features(chunk)
for chunk in np.array_split(data, chunks)]
@python_app
And more …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
• Create a containerized data science ecosystem that
encompasses laptops, servers, clouds, HPC
• Incorporate seamless parallel computing via Parsl
• Jupyter notebooks for rules-based automation
– Notebooks are triggered by events (e.g., new data available)
– Notebooks trigger events (e.g., computation completed)
globus.org
And more …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
• Create a containerized data science ecosystem that
encompasses laptops, servers, clouds, HPC
• Incorporate seamless parallel computing via Parsl
• Jupyter notebooks for rules-based automation
• Integration with JupyterLab (student summer project)
Juan David Garrido
Browse data on
local storage
Search remote
storage systems
Select files on
remote storage
Transfer data to
local storage
There it is!
Search remote
databases
Select materials
data
Inspect
materials data
Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana
Ananthakrishnan
Ryan Chard Mike Papka Rick Wagner
I reported on the work of many talented people
And others from the Globus team, the University of Chicago, and Argonne Nat Lab
We are grateful to our sponsors
DLHub Globus
IMaD
Petrel
Argonne Leadership
Computing Facility
At JupyterCon 2018
For more information
“Globus APIs enable authentication, data access, data
movement, data sharing, data search”
 See https://docs.globus.org
“Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource”
 Tutorial: https://jupyter.demo.globus.org
Blog: https://www.globus.org/blog/using-globus-jupyter-notebooks
“But wait, there’s more!”
 Talk to me, or check back in a few months
foster@uchicago.edu

Scaling collaborative data science with Globus and Jupyter

  • 1.
    Ian Foster foster@uchicago.edu JupyterCon, NewYork, August 23, 2018 Scaling collaborative data science with Globus and Jupyter
  • 2.
    Andre Schleife UIUC Modeling stoppingpower with time-dependent density functional theory Hydrogen in Gold, v=2.0 16,000 CPU-hours per simulation SampleExperimental sca ering Material composi on Simulated structure Simulated sca ering La 60% Sr 40% Evolu onary op miza on 786,432 CPUs, 10 PFLOPS (1016 flops) supercomputer Argonne Leadership Computing Facility
  • 3.
  • 4.
    But data arebig and distributed, and our science is collaborative (1) Query (2) Transfer (3) Learn materialsdatafacility.org petrel.alcf.anl.gov Cooley: 290 TeraFLOPS (4) Share 2 PB, 80 Gbps Globus-enabled store 3.2M materials data We need multi-credential, multi-service authentication and big data management
  • 5.
    operated by UChicagofor researchers worldwide Auto- mate globus.org Globus services
  • 6.
    • Multi-user Hub •Configurable HTTP proxy • Multiple single-user Jupyter notebook servers Recall: JupyterHub components Hub Configurable HTTP proxy Authenticator User database Spawner Notebook /api/auth Browser /hub/ /user/[name]/
  • 7.
    • Multi-user Hub •Configurable HTTP proxy • Multiple single-user Jupyter notebook servers Recall: JupyterHub components We want to grant notebooks access to the world • Tokens for remote services • APIs for remote actions: e.g., Globus data management Hub Configurable HTTP proxy Authenticator User database Spawner Notebook /api/auth Browser /hub/ /user/[name]/ Cooley Globus Petrel
  • 8.
    Securing JupyterHub withGlobus Auth We provide a simple Globus OAuth plugin • 100s of identity providers (can restrict which ones) • 1000 registered clients, apps • Custom scopes • Tokens passed into notebook environment JupyterHub OAuthenticator
  • 9.
    Use within JupyterHubis easy https://github.com/jupyterhub/oauthenticator#globus-setup
  • 10.
    Tokens are easilyused within notebooks Login REST APIs {“tokens”:… {“tokens”:… REST APIs REST APIs Bearer a45cd… Globus Transfer Globus Search Globus Publish Your App Another App Hub Configurable HTTP proxy Authenticator User database Notebook /hub/ /user/[name]/ Spawner /api/auth Browser
  • 11.
    In particular, youcan access Globus services Globus Transfer • Uniform access to distributed storage (Posix, S3, Ceph, HPSS, Google Drive, Hadoop, Lustre,, …) • HTTPS; GridFTP for high- speed, reliable, third-party transfers • Shared endpoints: User- managed access control • Web, REST, CLI access • HIPAA compliant 12,000 active Globus Connect endpoints (including most universities and labs)
  • 12.
    In particular, youcan access Globus services Globus Search • Cloud-hosted, schema agnostic • Scale to billions of objects Globus Identifiers • Digital object ids for your data • DataCite or other metadata Globus Publication platform • Customized publication pipelines Canadian Federated Research Data Repository: https://frdr.ca/
  • 13.
    Demonstration What we’re goingto do: • Login into our JupyterCon JupyterHub* • Launch (spawn) a Notebook Server • Get tokens • Access some Globus APIs • Download some data • Plot it • PUT the result on an HTTPS endpoint *Zero to JupyterHub: Fast JupyterHub on Kubernetes https://zero-to-jupyterhub.readthedocs.io
  • 14.
  • 15.
    The story sofar … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource
  • 16.
    But wait, there’smore! • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC
  • 17.
    Container Registry I AM cont ainer met adat a cont ai ner r eci pes ALCF Petrel cont ai ner s Supercomputer compute compute compute compute JupyterHub Notebook Server Containers are staged to local file systems Users select the container to execute their custom Jupyter environment The same containers can be used for both Jupyter notebook server and compute nodes, for consistency Unified IAM platform scalable for distributed projects Container definitions are tracked in version control systems A registry for container discovery and referencing Containers can be used for other tasks: analysis; ML; etc. Containers everywhere
  • 18.
    And more … •Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl
  • 19.
    Python parallel library •Tasks exposed as functions (Python or bash) • Python code to glue functions together • Globus for auth and data movement (Data) science applications require: • Interactivity • Scalability - Need more than a desktop • Reproducibility - Publish code and documentation Our solution: JupyterHub + Parsl  Interactive computing environment  Notebooks for publication  Can run on dedicated hardware parsl-project.org Interactive, scalable, reproducible data analysis @python_app def compute_features(chunk): for f in featurizers: chunk = f.featurize_dataframe(chunk, 'atoms') return chunk chunks = [compute_features(chunk) for chunk in np.array_split(data, chunks)]
  • 20.
  • 21.
    And more … •Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl • Jupyter notebooks for rules-based automation – Notebooks are triggered by events (e.g., new data available) – Notebooks trigger events (e.g., computation completed)
  • 22.
  • 23.
    And more … •Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl • Jupyter notebooks for rules-based automation • Integration with JupyterLab (student summer project)
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 31.
  • 32.
  • 33.
  • 34.
    Ben Blaiszik SteveTueckeKyle Chard Jim Pruyne Logan WardRachana Ananthakrishnan Ryan Chard Mike Papka Rick Wagner I reported on the work of many talented people And others from the Globus team, the University of Chicago, and Argonne Nat Lab We are grateful to our sponsors DLHub Globus IMaD Petrel Argonne Leadership Computing Facility At JupyterCon 2018
  • 35.
    For more information “GlobusAPIs enable authentication, data access, data movement, data sharing, data search”  See https://docs.globus.org “Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource”  Tutorial: https://jupyter.demo.globus.org Blog: https://www.globus.org/blog/using-globus-jupyter-notebooks “But wait, there’s more!”  Talk to me, or check back in a few months foster@uchicago.edu