IBM SparkTechnology Center
Big Analytics Meetup
Building Enterprise/Cloud Analytics Platform
with Jupyter Notebooks and Apache Spark
Luciano Resende
IBM | Spark Technology Center
IBM SparkTechnology Center
About Me
Luciano Resende (lresende AT apache DOT org)
• Architect and community liaison at IBM – Spark Technology Center
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Spark, Apache Toree among other projects related to Apache Spark ecosystem
2
@lresende1975 http://lresende.blogspot.com/ https://www.linkedin.com/in/lresendehttp://slideshare.net/luckbr1975lresende
IBM SparkTechnology Center
IBM and Open Source
Open Source Community Leadership
Spark	Technology	
Center
Founding	Partner 188+	Project	Committers 77+	Projects
Key	Open	source	
steering	committee	
OSS	Advisory	Board
Open	Source
IBM SparkTechnology Center
The Spark Technology Center
IBM SparkTechnology Center
IBM Spark Technology Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com
Key statistics:
About 40 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now a top level Apache project !
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
6
IBM SparkTechnology Center
Focus on meaningful code contributions across
all major Spark projects
863 code contributions (JIRAs) and counting –
Check out http://jiras.spark.tc
Over 422 commits in Spark 2.0 , and
continuing major contributions in 2.x
Contributions by the Spark Technology Center
across almost all components of Spark
— Spark Core, SparkR, SQL, MLlib,
Streaming, PySpark, build and infrastructure,
etc
STC	impact	on	community	
46,385	Spark	LOC
863	Spark	JIRAs
457	SystemML JIRAs
67	Speakers	at	Events
Apache Spark Contributions
IBM SparkTechnology Center
Project Focus Areas
SQL
TPC-DS and Performance
Query Pushdown/Federation
Machine Learning
Spark MLLib
R4ML
Online Retraining
Apache Arrow
SystemML
Deep Learning
Consumability
Reference architectures
Spark Notebook stack
Spark Resource optimization
Spark Web UI
Apache Bahir
RedRock
Immersive Insights
IBM SparkTechnology Center
Spark Notebook Stack
based on Jupyter Notebooks
IBM SparkTechnology Center
Jupyter Notebook Platform Architecture Overview
• Notebook UI runs on the browser
• The Notebook Server serves the ’Notebooks’
• Kernels interpret/execute cell contents
• Are responsible for code execution
• Abstracts different languages
10
IBM SparkTechnology Center
Enterprise/Cloud Analytics Platform Characteristics
Large pool of shared computing resources
• Enterprise Cloud, Public Cloud or Hybrid
• Data in the cloud (Data Lakes/Object Store)
Distributed Consumers
• Notebooks running local (users laptop) or as a service
Different Resource Utilization Patterns
• High number of idle resources
11
IBM SparkTechnology Center
Analytics Platform – Current state of the art
Open Source Jupyter based Notebook Platform
• Single User sharing the same distributed filesystem and privileges
• Jupyter Kernels running as local process
• Resources are limited by what is available on the one single node that runs all Kernels and associated Spark drivers.
• No security, users can see and control each others process using Jupyter’s administration
utilities.
12
IBM SparkTechnology Center
Analytics Platform Today – Shared Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels inside the cluster
sharing it’s resources.
• All Jupyter kernels run under a shared, “service” user ID.
• Users can see and control each others’ kernels using
Jupyter’s administration utilities.
• All kernels and their associated Spark drivers run on a
single (configurable) node of the cluster.
13
Spark Cluster
Bob’s Desktop
Multiple Notebooks
Alice’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter
Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
YARN
Workers
Security
Layer
Jupyter
Notebook
Server
(with NB2KG)
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
IBM SparkTechnology Center
Analytics Platform Today – Single User Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels in a cluster created
specially to the user.
• Expensive as clusters are created for every individual
user
14
Spark Cluster
Bob’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter
Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
YARN
Workers
Spark Cluster
Alice’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter
Kernel
Gateway
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Kernel
[Spark Driver]
(yarn-client mode
as JNBG Service
User)
YARN
Workers
Jupyter
Notebook
Server
(with NB2KG)
IBM SparkTechnology Center
Extended Jupyter Kernel Gateway
Notebook Platform based on Jupyter stack aiming on Enterprise/Cloud
requirements and use cases
15
IBM SparkTechnology Center
Extended Jupyter Kernel Gateway – Goals
Optimized Resource Allocation
•Run Spark in YARN Cluster Mode to better utilize cluster resources.
•Pluggable architecture for additional Resource Managers
Enhanced Security
•Enable TLS for all socket communications
•Any HTTP communication should be encrypted (SSL)
Multiuser support with user impersonation
•Enhance security and sandboxing by enabling user impersonation when running kernels.
•Individual HDFS home folder for each notebook user.
•Use the same user ID for notebook and batch jobs.
16
IBM SparkTechnology Center
Extended Jupyter Kernel Gateway
Extending Jupyter Kernel Gateway
• Enable running kernels remotely in a cluster
• Pluggable kernel lifecycle management
• Enhanced security
• Multiuser leveraging user impersonation
17
Extended Jupyter Kernel Gateway
Jupyter Kernel Gateway
Jupyter Notebook Server
IBM SparkTechnology Center
Spark Cluster
Extended Jupyter Kernel Gateway
18
Security
Layer
Alice’s Desktop
Multiple Notebooks
Jupyter
Notebook
Server
(with NB2KG)
YARN
Workers
Extended Jupyter Kernel Gateway
Jupyter REST API
User Session Manager
Kernel Lyfecycle
Kernel Communication (local/remote)
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Bob’s Desktop
Multiple Notebooks
Jupyter
Notebook
Server
(with NB2KG)
Impersonation:
Alice’s kernel
runs under Alice’s
user ID.
IBM SparkTechnology Center
Extended Jupyter Kernel Gateway
Stay tuned, we are becoming open source very soon!!!
Are you considering being an early adopter, please contact me at
lresende AT us DOT ibm DOT com !!!
19
IBM SparkTechnology Center
Other Resources
IBM developerWorks Code IBM developerWorks Journeys
https://developer.ibm.com/code/ https://developer.ibm.com/code/journey/
20

Big analytics meetup - Extended Jupyter Kernel Gateway

  • 1.
    IBM SparkTechnology Center BigAnalytics Meetup Building Enterprise/Cloud Analytics Platform with Jupyter Notebooks and Apache Spark Luciano Resende IBM | Spark Technology Center
  • 2.
    IBM SparkTechnology Center AboutMe Luciano Resende (lresende AT apache DOT org) • Architect and community liaison at IBM – Spark Technology Center • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Spark, Apache Toree among other projects related to Apache Spark ecosystem 2 @lresende1975 http://lresende.blogspot.com/ https://www.linkedin.com/in/lresendehttp://slideshare.net/luckbr1975lresende
  • 3.
  • 4.
    Open Source CommunityLeadership Spark Technology Center Founding Partner 188+ Project Committers 77+ Projects Key Open source steering committee OSS Advisory Board Open Source
  • 5.
    IBM SparkTechnology Center TheSpark Technology Center
  • 6.
    IBM SparkTechnology Center IBMSpark Technology Center Founded in 2015. Location: Physical: 505 Howard St., San Francisco CA Web: http://spark.tc Twitter: @apachespark_tc Mission: Contribute intellectual and technical capital to the Apache Spark community. Make the core technology enterprise- and cloud-ready. Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com Key statistics: About 40 developers, co-located with 25 IBM designers. Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now a top level Apache project ! Founding member of UC Berkeley AMPLab and RISE Lab Member of R Consortium and Scala Center 6
  • 7.
    IBM SparkTechnology Center Focuson meaningful code contributions across all major Spark projects 863 code contributions (JIRAs) and counting – Check out http://jiras.spark.tc Over 422 commits in Spark 2.0 , and continuing major contributions in 2.x Contributions by the Spark Technology Center across almost all components of Spark — Spark Core, SparkR, SQL, MLlib, Streaming, PySpark, build and infrastructure, etc STC impact on community 46,385 Spark LOC 863 Spark JIRAs 457 SystemML JIRAs 67 Speakers at Events Apache Spark Contributions
  • 8.
    IBM SparkTechnology Center ProjectFocus Areas SQL TPC-DS and Performance Query Pushdown/Federation Machine Learning Spark MLLib R4ML Online Retraining Apache Arrow SystemML Deep Learning Consumability Reference architectures Spark Notebook stack Spark Resource optimization Spark Web UI Apache Bahir RedRock Immersive Insights
  • 9.
    IBM SparkTechnology Center SparkNotebook Stack based on Jupyter Notebooks
  • 10.
    IBM SparkTechnology Center JupyterNotebook Platform Architecture Overview • Notebook UI runs on the browser • The Notebook Server serves the ’Notebooks’ • Kernels interpret/execute cell contents • Are responsible for code execution • Abstracts different languages 10
  • 11.
    IBM SparkTechnology Center Enterprise/CloudAnalytics Platform Characteristics Large pool of shared computing resources • Enterprise Cloud, Public Cloud or Hybrid • Data in the cloud (Data Lakes/Object Store) Distributed Consumers • Notebooks running local (users laptop) or as a service Different Resource Utilization Patterns • High number of idle resources 11
  • 12.
    IBM SparkTechnology Center AnalyticsPlatform – Current state of the art Open Source Jupyter based Notebook Platform • Single User sharing the same distributed filesystem and privileges • Jupyter Kernels running as local process • Resources are limited by what is available on the one single node that runs all Kernels and associated Spark drivers. • No security, users can see and control each others process using Jupyter’s administration utilities. 12
  • 13.
    IBM SparkTechnology Center AnalyticsPlatform Today – Shared Cluster Allows Jupyter notebooks running outside of the cluster to run Jupyter kernels inside the cluster sharing it’s resources. • All Jupyter kernels run under a shared, “service” user ID. • Users can see and control each others’ kernels using Jupyter’s administration utilities. • All kernels and their associated Spark drivers run on a single (configurable) node of the cluster. 13 Spark Cluster Bob’s Desktop Multiple Notebooks Alice’s Desktop Multiple Notebooks Jupyter Kernel Gateway (Sandboxed by service user privileges) Jupyter Kernel Gateway Jupyter Notebook Server (with NB2KG) Executors (as Alice)Executors (as Alice)Spark Executors (as JNBG Service User) Executors (as Alice)Executors (as Alice)Spark Executors (as JNBG Service User) Kernel [Spark Driver] (yarn-client mode as JNBG Service User) YARN Workers Security Layer Jupyter Notebook Server (with NB2KG) Kernel [Spark Driver] (yarn-client mode as JNBG Service User)
  • 14.
    IBM SparkTechnology Center AnalyticsPlatform Today – Single User Cluster Allows Jupyter notebooks running outside of the cluster to run Jupyter kernels in a cluster created specially to the user. • Expensive as clusters are created for every individual user 14 Spark Cluster Bob’s Desktop Multiple Notebooks Jupyter Kernel Gateway (Sandboxed by service user privileges) Jupyter Kernel Gateway Jupyter Notebook Server (with NB2KG) Executors (as Alice)Executors (as Alice)Spark Executors (as JNBG Service User) Kernel [Spark Driver] (yarn-client mode as JNBG Service User) YARN Workers Spark Cluster Alice’s Desktop Multiple Notebooks Jupyter Kernel Gateway (Sandboxed by service user privileges) Jupyter Kernel Gateway Executors (as Alice)Executors (as Alice)Spark Executors (as JNBG Service User) Kernel [Spark Driver] (yarn-client mode as JNBG Service User) YARN Workers Jupyter Notebook Server (with NB2KG)
  • 15.
    IBM SparkTechnology Center ExtendedJupyter Kernel Gateway Notebook Platform based on Jupyter stack aiming on Enterprise/Cloud requirements and use cases 15
  • 16.
    IBM SparkTechnology Center ExtendedJupyter Kernel Gateway – Goals Optimized Resource Allocation •Run Spark in YARN Cluster Mode to better utilize cluster resources. •Pluggable architecture for additional Resource Managers Enhanced Security •Enable TLS for all socket communications •Any HTTP communication should be encrypted (SSL) Multiuser support with user impersonation •Enhance security and sandboxing by enabling user impersonation when running kernels. •Individual HDFS home folder for each notebook user. •Use the same user ID for notebook and batch jobs. 16
  • 17.
    IBM SparkTechnology Center ExtendedJupyter Kernel Gateway Extending Jupyter Kernel Gateway • Enable running kernels remotely in a cluster • Pluggable kernel lifecycle management • Enhanced security • Multiuser leveraging user impersonation 17 Extended Jupyter Kernel Gateway Jupyter Kernel Gateway Jupyter Notebook Server
  • 18.
    IBM SparkTechnology Center SparkCluster Extended Jupyter Kernel Gateway 18 Security Layer Alice’s Desktop Multiple Notebooks Jupyter Notebook Server (with NB2KG) YARN Workers Extended Jupyter Kernel Gateway Jupyter REST API User Session Manager Kernel Lyfecycle Kernel Communication (local/remote) Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Bob’s Desktop Multiple Notebooks Jupyter Notebook Server (with NB2KG) Impersonation: Alice’s kernel runs under Alice’s user ID.
  • 19.
    IBM SparkTechnology Center ExtendedJupyter Kernel Gateway Stay tuned, we are becoming open source very soon!!! Are you considering being an early adopter, please contact me at lresende AT us DOT ibm DOT com !!! 19
  • 20.
    IBM SparkTechnology Center OtherResources IBM developerWorks Code IBM developerWorks Journeys https://developer.ibm.com/code/ https://developer.ibm.com/code/journey/ 20