Scaling collaborative data science with Globus and Jupyter

Ian Foster
Ian FosterDirector, Data Science and Learning Division; Professor of Computer Science
Ian Foster
foster@uchicago.edu
JupyterCon, New York, August 23, 2018
Scaling collaborative
data science
with Globus and Jupyter
Andre Schleife
UIUC
Modeling stopping power
with time-dependent
density functional theory
Hydrogen in Gold, v=2.0
16,000 CPU-hours per simulation
SampleExperimental
sca ering
Material
composi on
Simulated
structure
Simulated
sca ering
La 60%
Sr 40%
Evolu onary op miza on
786,432 CPUs, 10 PFLOPS (1016 flops) supercomputer
Argonne Leadership Computing Facility
@python_app
Logan Ward
But data are big and distributed, and
our science is collaborative
(1) Query
(2) Transfer
(3) Learn
materialsdatafacility.org
petrel.alcf.anl.gov
Cooley: 290 TeraFLOPS
(4) Share
2 PB, 80 Gbps Globus-enabled store
3.2M materials data
We need multi-credential,
multi-service authentication
and big data management
operated by UChicago for researchers worldwide
Auto-
mate
globus.org
Globus services
• Multi-user Hub
• Configurable HTTP proxy
• Multiple single-user Jupyter
notebook servers
Recall: JupyterHub components
Hub
Configurable HTTP proxy
Authenticator
User database
Spawner
Notebook
/api/auth
Browser
/hub/
/user/[name]/
• Multi-user Hub
• Configurable HTTP proxy
• Multiple single-user Jupyter
notebook servers
Recall: JupyterHub components
We want to grant notebooks
access to the world
• Tokens for remote services
• APIs for remote actions: e.g.,
Globus data management Hub
Configurable HTTP proxy
Authenticator
User database
Spawner
Notebook
/api/auth
Browser
/hub/
/user/[name]/
Cooley Globus
Petrel
Securing JupyterHub with Globus Auth
We provide a simple
Globus OAuth plugin
• 100s of identity providers
(can restrict which ones)
• 1000 registered clients, apps
• Custom scopes
• Tokens passed into notebook
environment
JupyterHub OAuthenticator
Use within JupyterHub is easy
https://github.com/jupyterhub/oauthenticator#globus-setup
Tokens are easily used within notebooks
Login
REST APIs
{“tokens”:…
{“tokens”:…
REST APIs
REST APIs
Bearer a45cd…
Globus Transfer
Globus Search
Globus Publish
Your App
Another App
Hub
Configurable HTTP proxy
Authenticator
User database Notebook
/hub/
/user/[name]/
Spawner
/api/auth
Browser
In particular, you can access Globus services
Globus Transfer
• Uniform access to
distributed storage (Posix,
S3, Ceph, HPSS, Google
Drive, Hadoop, Lustre,, …)
• HTTPS; GridFTP for high-
speed, reliable, third-party
transfers
• Shared endpoints: User-
managed access control
• Web, REST, CLI access
• HIPAA compliant 12,000 active Globus Connect endpoints
(including most universities and labs)
In particular, you can access Globus services
Globus Search
• Cloud-hosted, schema agnostic
• Scale to billions of objects
Globus Identifiers
• Digital object ids for your data
• DataCite or other metadata
Globus Publication platform
• Customized publication pipelines Canadian Federated Research
Data Repository: https://frdr.ca/
Demonstration
What we’re going to do:
• Login into our JupyterCon JupyterHub*
• Launch (spawn) a Notebook Server
• Get tokens
• Access some Globus APIs
• Download some data
• Plot it
• PUT the result on an HTTPS endpoint
*Zero to JupyterHub: Fast JupyterHub on Kubernetes
https://zero-to-jupyterhub.readthedocs.io
https://jupyter.demo.globus.org/
Login to Start Tutorial
The story so far …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
But wait, there’s more!
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
• Create a containerized data science ecosystem that
encompasses laptops, servers, clouds, HPC
Container
Registry
I AM
cont ai ner met adat a
cont ai ner r eci pes
ALCF Petrel
cont ai ner s
Supercomputer
compute
compute
compute
compute
JupyterHub
Notebook Server
Containers are staged
to local file systems
Users select the container
to execute their custom
Jupyter environment
The same containers can be used for
both Jupyter notebook server and
compute nodes, for consistency
Unified IAM platform
scalable for
distributed projects
Container definitions
are tracked in version
control systems
A registry for container
discovery and referencing
Containers can be used for
other tasks: analysis; ML; etc.
Containers
everywhere
And more …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
• Create a containerized data science ecosystem that
encompasses laptops, servers, clouds, HPC
• Incorporate seamless parallel computing via Parsl
Python parallel library
• Tasks exposed as
functions (Python or bash)
• Python code to glue functions together
• Globus for auth and data movement
(Data) science applications require:
• Interactivity
• Scalability
- Need more than a desktop
• Reproducibility
- Publish code and documentation
Our solution: JupyterHub + Parsl
 Interactive computing environment
 Notebooks for publication
 Can run on dedicated hardware
parsl-project.org
Interactive, scalable, reproducible data analysis
@python_app
def compute_features(chunk):
for f in featurizers:
chunk = f.featurize_dataframe(chunk, 'atoms')
return chunk
chunks = [compute_features(chunk)
for chunk in np.array_split(data, chunks)]
@python_app
And more …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
• Create a containerized data science ecosystem that
encompasses laptops, servers, clouds, HPC
• Incorporate seamless parallel computing via Parsl
• Jupyter notebooks for rules-based automation
– Notebooks are triggered by events (e.g., new data available)
– Notebooks trigger events (e.g., computation completed)
globus.org
And more …
• Globus APIs enable authentication, data access, data
movement, data sharing, data search
• Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource
• Create a containerized data science ecosystem that
encompasses laptops, servers, clouds, HPC
• Incorporate seamless parallel computing via Parsl
• Jupyter notebooks for rules-based automation
• Integration with JupyterLab (student summer project)
Juan David Garrido
Browse data on
local storage
Search remote
storage systems
Select files on
remote storage
Transfer data to
local storage
There it is!
Scaling collaborative data science with Globus and Jupyter
Search remote
databases
Select materials
data
Inspect
materials data
Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana
Ananthakrishnan
Ryan Chard Mike Papka Rick Wagner
I reported on the work of many talented people
And others from the Globus team, the University of Chicago, and Argonne Nat Lab
We are grateful to our sponsors
DLHub Globus
IMaD
Petrel
Argonne Leadership
Computing Facility
At JupyterCon 2018
For more information
“Globus APIs enable authentication, data access, data
movement, data sharing, data search”
 See https://docs.globus.org
“Can be used in notebooks and in JupyterHub/Lab to
access any data anywhere and to secure any resource”
 Tutorial: https://jupyter.demo.globus.org
Blog: https://www.globus.org/blog/using-globus-jupyter-notebooks
“But wait, there’s more!”
 Talk to me, or check back in a few months
foster@uchicago.edu
1 of 35

Recommended

Data Tribology: Overcoming Data Friction with Cloud Automation by
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
892 views18 slides
Research Automation for Data-Driven Discovery by
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven DiscoveryGlobus
173 views18 slides
Coding the Continuum by
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
1.7K views50 slides
Data Automation at Light Sources by
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
526 views35 slides
Learning Systems for Science by
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
454 views34 slides
Reproducible Research and the Cloud by
Reproducible Research and the CloudReproducible Research and the Cloud
Reproducible Research and the CloudMicrosoft Azure for Research
1.5K views24 slides

More Related Content

What's hot

Accelerating data-intensive science by outsourcing the mundane by
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
1.1K views40 slides
Doing Research in the Cloud - NIH Workshop Dennis Gannon by
Doing Research in the Cloud - NIH Workshop Dennis GannonDoing Research in the Cloud - NIH Workshop Dennis Gannon
Doing Research in the Cloud - NIH Workshop Dennis GannonMicrosoft Azure for Research
732 views22 slides
Accelerating your research with Microsoft Azure by
Accelerating your research with Microsoft AzureAccelerating your research with Microsoft Azure
Accelerating your research with Microsoft AzureMicrosoft Azure for Research
1.9K views38 slides
A4 r overview deck_1.7 by
A4 r overview deck_1.7A4 r overview deck_1.7
A4 r overview deck_1.7Microsoft Azure for Research
759 views13 slides
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ... by
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
443 views57 slides
Materials Data Facility: Streamlined and automated data sharing, discovery, ... by
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
880 views39 slides

What's hot(20)

Accelerating data-intensive science by outsourcing the mundane by Ian Foster
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
Ian Foster1.1K views
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ... by Ian Foster
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Ian Foster443 views
Materials Data Facility: Streamlined and automated data sharing, discovery, ... by Ian Foster
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster880 views
Analytics and Access to the UK web archive by Lewis Crawford
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
Lewis Crawford726 views
Cloud com foster december 2010 by Ian Foster
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
Ian Foster579 views
Accelerating Discovery via Science Services by Ian Foster
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
Ian Foster1.3K views
Open Science Data Cloud (IEEE Cloud 2011) by Robert Grossman
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
Robert Grossman870 views
Introduction NL-HUG (April) by Evert Lammerts
Introduction NL-HUG (April)Introduction NL-HUG (April)
Introduction NL-HUG (April)
Evert Lammerts1.1K views
"A Toolkit for Digital Research" - CNI 2013 by Kaitlin Thaney
"A Toolkit for Digital Research" - CNI 2013"A Toolkit for Digital Research" - CNI 2013
"A Toolkit for Digital Research" - CNI 2013
Kaitlin Thaney3K views
Bioclouds CAMDA (Robert Grossman) 09-v9p by Robert Grossman
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
Robert Grossman734 views
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir by Spark Summit
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit1.1K views
Taming Big Data! by Ian Foster
Taming Big Data!Taming Big Data!
Taming Big Data!
Ian Foster2.3K views
GEN: A Database Interface Generator for HPC Programs by Tanu Malik
GEN: A Database Interface Generator for HPC ProgramsGEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC Programs
Tanu Malik681 views

Similar to Scaling collaborative data science with Globus and Jupyter

Simplified Research Data Management with the Globus Platform by
Simplified Research Data Management with the Globus PlatformSimplified Research Data Management with the Globus Platform
Simplified Research Data Management with the Globus PlatformGlobus
104 views64 slides
Globus Labs: Forging the Next Frontier by
Globus Labs: Forging the Next FrontierGlobus Labs: Forging the Next Frontier
Globus Labs: Forging the Next FrontierGlobus
142 views15 slides
Globus Integrations (GlobusWorld Tour - UCSD) by
Globus Integrations (GlobusWorld Tour - UCSD)Globus Integrations (GlobusWorld Tour - UCSD)
Globus Integrations (GlobusWorld Tour - UCSD)Globus
60 views22 slides
The Discovery Cloud: Accelerating Science via Outsourcing and Automation by
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
937 views38 slides
Globus Integrations (CHPC 2019 - South Africa) by
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus
158 views24 slides
Globus Integrations (GlobusWorld Tour - UMich) by
Globus Integrations (GlobusWorld Tour - UMich)Globus Integrations (GlobusWorld Tour - UMich)
Globus Integrations (GlobusWorld Tour - UMich)Globus
74 views21 slides

Similar to Scaling collaborative data science with Globus and Jupyter(20)

Simplified Research Data Management with the Globus Platform by Globus
Simplified Research Data Management with the Globus PlatformSimplified Research Data Management with the Globus Platform
Simplified Research Data Management with the Globus Platform
Globus 104 views
Globus Labs: Forging the Next Frontier by Globus
Globus Labs: Forging the Next FrontierGlobus Labs: Forging the Next Frontier
Globus Labs: Forging the Next Frontier
Globus 142 views
Globus Integrations (GlobusWorld Tour - UCSD) by Globus
Globus Integrations (GlobusWorld Tour - UCSD)Globus Integrations (GlobusWorld Tour - UCSD)
Globus Integrations (GlobusWorld Tour - UCSD)
Globus 60 views
The Discovery Cloud: Accelerating Science via Outsourcing and Automation by Ian Foster
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster937 views
Globus Integrations (CHPC 2019 - South Africa) by Globus
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)
Globus 158 views
Globus Integrations (GlobusWorld Tour - UMich) by Globus
Globus Integrations (GlobusWorld Tour - UMich)Globus Integrations (GlobusWorld Tour - UMich)
Globus Integrations (GlobusWorld Tour - UMich)
Globus 74 views
Automating Research Data Management at Scale with Globus by Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
Globus 221 views
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem... by Ian Foster
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster816 views
Scalable Parallel Programming in Python with Parsl by Globus
Scalable Parallel Programming in Python with ParslScalable Parallel Programming in Python with Parsl
Scalable Parallel Programming in Python with Parsl
Globus 609 views
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP... by Globus
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Globus 76 views
Facilitating Collaboration with Globus (GlobusWorld Tour - STFC) by Globus
Facilitating Collaboration with Globus (GlobusWorld Tour - STFC)Facilitating Collaboration with Globus (GlobusWorld Tour - STFC)
Facilitating Collaboration with Globus (GlobusWorld Tour - STFC)
Globus 125 views
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech... by Databricks
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks935 views
Globus Integrations (JupyterHub, Django, ...) by Globus
Globus Integrations (JupyterHub, Django, ...)Globus Integrations (JupyterHub, Django, ...)
Globus Integrations (JupyterHub, Django, ...)
Globus 350 views
re:Invent 2013-foster-madduri by Ravi Madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
Ravi Madduri973 views
Big Process for Big Data @ NASA by Ian Foster
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
Ian Foster1.2K views
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P... by Bertram Ludäscher
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Bertram Ludäscher684 views
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S... by Cloudian
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Cloudian1.6K views
Globus: Research Data Management as Service and Platform - pearc17 by Mary Bass
Globus: Research Data Management as Service and Platform - pearc17Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17
Mary Bass130 views
Hadoop/MapReduce/HDFS by praveen bhat
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
praveen bhat538 views

More from Ian Foster

Global Services for Global Science March 2023.pptx by
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxIan Foster
41 views41 slides
The Earth System Grid Federation: Origins, Current State, Evolution by
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
15 views30 slides
Better Information Faster: Programming the Continuum by
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
163 views41 slides
ESnet6 and Smart Instruments by
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsIan Foster
72 views21 slides
Linking Scientific Instruments and Computation by
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
39 views28 slides
A Global Research Data Platform: How Globus Services Enable Scientific Discovery by
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
28 views39 slides

More from Ian Foster(20)

Global Services for Global Science March 2023.pptx by Ian Foster
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
Ian Foster41 views
The Earth System Grid Federation: Origins, Current State, Evolution by Ian Foster
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
Ian Foster15 views
Better Information Faster: Programming the Continuum by Ian Foster
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
Ian Foster163 views
ESnet6 and Smart Instruments by Ian Foster
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
Ian Foster72 views
Linking Scientific Instruments and Computation by Ian Foster
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
Ian Foster39 views
A Global Research Data Platform: How Globus Services Enable Scientific Discovery by Ian Foster
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
Ian Foster28 views
Foster CRA March 2022.pptx by Ian Foster
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
Ian Foster265 views
Big Data, Big Computing, AI, and Environmental Science by Ian Foster
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
Ian Foster126 views
AI at Scale for Materials and Chemistry by Ian Foster
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
Ian Foster1.3K views
Research Automation for Data-Driven Discovery by Ian Foster
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
Ian Foster419 views
Team Argon Summary by Ian Foster
Team Argon SummaryTeam Argon Summary
Team Argon Summary
Ian Foster482 views
Thoughts on interoperability by Ian Foster
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
Ian Foster356 views
NIH Data Commons Architecture Ideas by Ian Foster
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
Ian Foster587 views
Going Smart and Deep on Materials at ALCF by Ian Foster
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
Ian Foster1.7K views
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ... by Ian Foster
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Ian Foster1.4K views
Software Infrastructure for a National Research Platform by Ian Foster
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
Ian Foster716 views
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph... by Ian Foster
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Ian Foster347 views
Globus Auth: A Research Identity and Access Management Platform by Ian Foster
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
Ian Foster932 views
Streamlined data sharing and analysis to accelerate cancer research by Ian Foster
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
Ian Foster609 views
Accelerating Data-driven Discovery in Energy Science by Ian Foster
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
Ian Foster1.2K views

Recently uploaded

DRBD Deep Dive - Philipp Reisner - LINBIT by
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBITShapeBlue
180 views21 slides
The Power of Heat Decarbonisation Plans in the Built Environment by
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built EnvironmentIES VE
79 views20 slides
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueShapeBlue
135 views13 slides
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueShapeBlue
222 views7 slides
Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
96 views46 slides
Business Analyst Series 2023 - Week 4 Session 7 by
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7DianaGray10
139 views31 slides

Recently uploaded(20)

DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue180 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE79 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue135 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue222 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10139 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue159 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue126 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue166 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue218 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue252 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue145 views
Initiating and Advancing Your Strategic GIS Governance Strategy by Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software176 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue139 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue221 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue203 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue184 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc170 views

Scaling collaborative data science with Globus and Jupyter

  • 1. Ian Foster foster@uchicago.edu JupyterCon, New York, August 23, 2018 Scaling collaborative data science with Globus and Jupyter
  • 2. Andre Schleife UIUC Modeling stopping power with time-dependent density functional theory Hydrogen in Gold, v=2.0 16,000 CPU-hours per simulation SampleExperimental sca ering Material composi on Simulated structure Simulated sca ering La 60% Sr 40% Evolu onary op miza on 786,432 CPUs, 10 PFLOPS (1016 flops) supercomputer Argonne Leadership Computing Facility
  • 4. But data are big and distributed, and our science is collaborative (1) Query (2) Transfer (3) Learn materialsdatafacility.org petrel.alcf.anl.gov Cooley: 290 TeraFLOPS (4) Share 2 PB, 80 Gbps Globus-enabled store 3.2M materials data We need multi-credential, multi-service authentication and big data management
  • 5. operated by UChicago for researchers worldwide Auto- mate globus.org Globus services
  • 6. • Multi-user Hub • Configurable HTTP proxy • Multiple single-user Jupyter notebook servers Recall: JupyterHub components Hub Configurable HTTP proxy Authenticator User database Spawner Notebook /api/auth Browser /hub/ /user/[name]/
  • 7. • Multi-user Hub • Configurable HTTP proxy • Multiple single-user Jupyter notebook servers Recall: JupyterHub components We want to grant notebooks access to the world • Tokens for remote services • APIs for remote actions: e.g., Globus data management Hub Configurable HTTP proxy Authenticator User database Spawner Notebook /api/auth Browser /hub/ /user/[name]/ Cooley Globus Petrel
  • 8. Securing JupyterHub with Globus Auth We provide a simple Globus OAuth plugin • 100s of identity providers (can restrict which ones) • 1000 registered clients, apps • Custom scopes • Tokens passed into notebook environment JupyterHub OAuthenticator
  • 9. Use within JupyterHub is easy https://github.com/jupyterhub/oauthenticator#globus-setup
  • 10. Tokens are easily used within notebooks Login REST APIs {“tokens”:… {“tokens”:… REST APIs REST APIs Bearer a45cd… Globus Transfer Globus Search Globus Publish Your App Another App Hub Configurable HTTP proxy Authenticator User database Notebook /hub/ /user/[name]/ Spawner /api/auth Browser
  • 11. In particular, you can access Globus services Globus Transfer • Uniform access to distributed storage (Posix, S3, Ceph, HPSS, Google Drive, Hadoop, Lustre,, …) • HTTPS; GridFTP for high- speed, reliable, third-party transfers • Shared endpoints: User- managed access control • Web, REST, CLI access • HIPAA compliant 12,000 active Globus Connect endpoints (including most universities and labs)
  • 12. In particular, you can access Globus services Globus Search • Cloud-hosted, schema agnostic • Scale to billions of objects Globus Identifiers • Digital object ids for your data • DataCite or other metadata Globus Publication platform • Customized publication pipelines Canadian Federated Research Data Repository: https://frdr.ca/
  • 13. Demonstration What we’re going to do: • Login into our JupyterCon JupyterHub* • Launch (spawn) a Notebook Server • Get tokens • Access some Globus APIs • Download some data • Plot it • PUT the result on an HTTPS endpoint *Zero to JupyterHub: Fast JupyterHub on Kubernetes https://zero-to-jupyterhub.readthedocs.io
  • 15. The story so far … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource
  • 16. But wait, there’s more! • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC
  • 17. Container Registry I AM cont ai ner met adat a cont ai ner r eci pes ALCF Petrel cont ai ner s Supercomputer compute compute compute compute JupyterHub Notebook Server Containers are staged to local file systems Users select the container to execute their custom Jupyter environment The same containers can be used for both Jupyter notebook server and compute nodes, for consistency Unified IAM platform scalable for distributed projects Container definitions are tracked in version control systems A registry for container discovery and referencing Containers can be used for other tasks: analysis; ML; etc. Containers everywhere
  • 18. And more … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl
  • 19. Python parallel library • Tasks exposed as functions (Python or bash) • Python code to glue functions together • Globus for auth and data movement (Data) science applications require: • Interactivity • Scalability - Need more than a desktop • Reproducibility - Publish code and documentation Our solution: JupyterHub + Parsl  Interactive computing environment  Notebooks for publication  Can run on dedicated hardware parsl-project.org Interactive, scalable, reproducible data analysis @python_app def compute_features(chunk): for f in featurizers: chunk = f.featurize_dataframe(chunk, 'atoms') return chunk chunks = [compute_features(chunk) for chunk in np.array_split(data, chunks)]
  • 21. And more … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl • Jupyter notebooks for rules-based automation – Notebooks are triggered by events (e.g., new data available) – Notebooks trigger events (e.g., computation completed)
  • 23. And more … • Globus APIs enable authentication, data access, data movement, data sharing, data search • Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource • Create a containerized data science ecosystem that encompasses laptops, servers, clouds, HPC • Incorporate seamless parallel computing via Parsl • Jupyter notebooks for rules-based automation • Integration with JupyterLab (student summer project)
  • 34. Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana Ananthakrishnan Ryan Chard Mike Papka Rick Wagner I reported on the work of many talented people And others from the Globus team, the University of Chicago, and Argonne Nat Lab We are grateful to our sponsors DLHub Globus IMaD Petrel Argonne Leadership Computing Facility At JupyterCon 2018
  • 35. For more information “Globus APIs enable authentication, data access, data movement, data sharing, data search”  See https://docs.globus.org “Can be used in notebooks and in JupyterHub/Lab to access any data anywhere and to secure any resource”  Tutorial: https://jupyter.demo.globus.org Blog: https://www.globus.org/blog/using-globus-jupyter-notebooks “But wait, there’s more!”  Talk to me, or check back in a few months foster@uchicago.edu