Big analytics meetup - Extended Jupyter Kernel Gateway

IBM SparkTechnology Center
Big Analytics Meetup
Building Enterprise/Cloud Analytics Platform
with Jupyter Notebooks and Apache Spark
Luciano Resende
IBM | Spark Technology Center

About Me
Luciano Resende (lresende AT apache DOT org)
• Architect and community liaison at IBM – Spark Technology Center
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Spark, Apache Toree among other projects related to Apache Spark ecosystem
2
@lresende1975 http://lresende.blogspot.com/ https://www.linkedin.com/in/lresendehttp://slideshare.net/luckbr1975lresende

IBM and Open Source

Open Source Community Leadership
Spark Technology
Center
Founding Partner 188+ Project Committers 77+ Projects
Key Open source
steering committee
OSS Advisory Board
Open Source

The Spark Technology Center

IBM Spark Technology Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com
Key statistics:
About 40 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now a top level Apache project !
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
6

Focus on meaningful code contributions across
all major Spark projects
863 code contributions (JIRAs) and counting –
Check out http://jiras.spark.tc
Over 422 commits in Spark 2.0 , and
continuing major contributions in 2.x
Contributions by the Spark Technology Center
across almost all components of Spark
— Spark Core, SparkR, SQL, MLlib,
Streaming, PySpark, build and infrastructure,
etc
STC impact on community
46,385 Spark LOC
863 Spark JIRAs
457 SystemML JIRAs
67 Speakers at Events
Apache Spark Contributions

Project Focus Areas
SQL
TPC-DS and Performance
Query Pushdown/Federation
Machine Learning
Spark MLLib
R4ML
Online Retraining
Apache Arrow
SystemML
Deep Learning
Consumability
Reference architectures
Spark Notebook stack
Spark Resource optimization
Spark Web UI
Apache Bahir
RedRock
Immersive Insights

Spark Notebook Stack
based on Jupyter Notebooks

Jupyter Notebook Platform Architecture Overview
• Notebook UI runs on the browser
• The Notebook Server serves the ’Notebooks’
• Kernels interpret/execute cell contents
• Are responsible for code execution
• Abstracts different languages
10

Enterprise/Cloud Analytics Platform Characteristics
Large pool of shared computing resources
• Enterprise Cloud, Public Cloud or Hybrid
• Data in the cloud (Data Lakes/Object Store)
Distributed Consumers
• Notebooks running local (users laptop) or as a service
Different Resource Utilization Patterns
• High number of idle resources
11

Analytics Platform – Current state of the art
Open Source Jupyter based Notebook Platform
• Single User sharing the same distributed filesystem and privileges
• Jupyter Kernels running as local process
• Resources are limited by what is available on the one single node that runs all Kernels and associated Spark drivers.
• No security, users can see and control each others process using Jupyter’s administration
utilities.
12

Analytics Platform Today – Shared Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels inside the cluster
sharing it’s resources.
• All Jupyter kernels run under a shared, “service” user ID.
• Users can see and control each others’ kernels using
Jupyter’s administration utilities.
• All kernels and their associated Spark drivers run on a
single (configurable) node of the cluster.
13
Spark Cluster
Bob’s Desktop
Multiple Notebooks
Alice’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter
Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Executors
(as Alice)Executors
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
YARN
Workers
Security
Layer
Jupyter
Notebook
Server
(with NB2KG)
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)

Analytics Platform Today – Single User Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels in a cluster created
specially to the user.
• Expensive as clusters are created for every individual
user
14
Spark Cluster
Bob’s Desktop
Multiple Notebooks
Jupyter
Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
YARN
Workers
Spark Cluster
Alice’s Desktop
Multiple Notebooks
Jupyter
Kernel
Gateway
Executors
(as Alice)Executors
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
YARN
Workers
Jupyter
Notebook
Server
(with NB2KG)

Extended Jupyter Kernel Gateway
Notebook Platform based on Jupyter stack aiming on Enterprise/Cloud
requirements and use cases
15

Extended Jupyter Kernel Gateway – Goals
Optimized Resource Allocation
•Run Spark in YARN Cluster Mode to better utilize cluster resources.
•Pluggable architecture for additional Resource Managers
Enhanced Security
•Enable TLS for all socket communications
•Any HTTP communication should be encrypted (SSL)
Multiuser support with user impersonation
•Enhance security and sandboxing by enabling user impersonation when running kernels.
•Individual HDFS home folder for each notebook user.
•Use the same user ID for notebook and batch jobs.
16

Extending Jupyter Kernel Gateway
• Enable running kernels remotely in a cluster
• Pluggable kernel lifecycle management
• Enhanced security
• Multiuser leveraging user impersonation
17
Jupyter Notebook Server

Spark Cluster
18
Security
Layer
Alice’s Desktop
Multiple Notebooks
Jupyter
Notebook
Server
(with NB2KG)
YARN
Workers
Jupyter REST API
User Session Manager
Kernel Lyfecycle
Kernel Communication (local/remote)
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Bob’s Desktop
Multiple Notebooks
Jupyter
Notebook
Server
(with NB2KG)
Impersonation:
Alice’s kernel
runs under Alice’s
user ID.

Stay tuned, we are becoming open source very soon!!!
Are you considering being an early adopter, please contact me at
lresende AT us DOT ibm DOT com !!!
19

Other Resources
IBM developerWorks Code IBM developerWorks Journeys
https://developer.ibm.com/code/ https://developer.ibm.com/code/journey/
20

Big analytics meetup - Extended Jupyter Kernel Gateway

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big analytics meetup - Extended Jupyter Kernel Gateway

Similar to Big analytics meetup - Extended Jupyter Kernel Gateway (20)

More from Luciano Resende

More from Luciano Resende (20)

Recently uploaded

Recently uploaded (20)

Big analytics meetup - Extended Jupyter Kernel Gateway