This document discusses building analytical microservices powered by Jupyter kernels. It provides an overview of Jupyter notebooks and their architecture. It then introduces the Jupyter Enterprise Gateway, which allows running Jupyter kernels on a distributed cluster for improved scalability and security. Finally, it demonstrates a use case of a sentiment analysis microservice that leverages PySpark on a Hadoop cluster via Jupyter kernels.
3. About me - Luciano Resende
Data Science Platform Architect – IBM – CODAIT (formerly Spark Technology Center)
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Spark,
Apache Toree among other projects related to Apache Spark ecosystem
lresende@apache.org
http://lresende.blogspot.com/
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
@
4. About me – Kevin Bates
Sr. Software Engineer – IBM – CODAIT (formerly Spark Technology Center)
• Over 30 years developing enterprise-level software
• Currently working in the Jupyter ecosystem the last 14 months
kbates4@gmail.com
https://www.linkedin.com/in/kevinbatessoftware
@kbates4
https://github.com/kevin-bates
@
5. Jupyter Notebooks
Notebooks are interactive computational
environments, in which you can
combine code execution, rich text,
mathematics, plots and rich media.
6. Jupyter Notebook Platform Architecture
• Notebook UI runs on the browser
• The Notebook Server serves the ’Notebooks’
• Kernels interpret/execute cell contents
§ Are responsible for code execution
§ Abstracts different languages
7. Jupyter Notebooks Architecture
Notebook Server Process
JavaScript
NotebookManagement
Python Process
KernelManagement
IPythonKernel
KernelProxy
Shell
IOPub
stdin
control
heartbeat
UserCode
sklearn
Spark
Tensor
Flow
…
9. Jupyter Messaging Protocol
Available Sockets:
• Shell (requests, history, info)
• IOPub (status, display, results)
• Stdin (input requests from kernel)
• Control (shutdown, interrupt)
• Heartbeat (poll)
10. Message flow
Two types of responses
• Results
• Computations that return a result
• 1+1
• val a = 2 + 5
• Stream Content
• Values that are written to output stream
• println(‘Hello World’)
• df.show(10)
Client Program Kernel
Evaluate (msgid=1) ‘1+1’
Busy (msgid=1)
Status (msgid=1) ok/error
Result (msgid=1)
Stream Content (msgid=1)
Idle (msgid=1)
12. Jupyter Notebooks Architecture
Notebook Server Process
JavaScript
NotebookManagement
Python Process
KernelManagement
IPythonKernel
KernelProxy
Shell
IOPub
stdin
control
heartbeat
UserCode
sklearn
Spark
Tensor
Flow
…
13. The Origins of Jupyter Enterprise Gateway
• Multiple IBM products embedding Spark on YARN
• All wanted to add Jupyter notebooks with Spark
• Usual enterprise requirements (multitenancy, scalability, security, etc.)
• Attempts at scaling up (one large server) or having a single Notebook server
per user were insufficient
• Jupyter Kernel Gateway introduced a Bring Your Own Notebook model via
Websocket “Personality” and Notebook extension – nb2kg
14. Initial prototype using Jupyter Kernel Gateway
YARN Cluster
Security
Layer
YARN
Workers
YARN
Resource
Manager
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Gateway Node
nb2kg
(Proxy)
nb2kg
Jupyter
Kernel
Gateway
Python
Kernel
Spark Driver
Python
Kernel
Spark Driver
Shell
IOPub
stdin
control
heartbeat
Issue #2: All Spark jobs
run as same user ID
Issue #1: All kernels
and Spark drivers run
on a single node
15. Issue: All kernels run on a single node
8 8 8 8
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MAXKERNELS(4GBHEAP)
CLUSTER SIZE (32GB NODES)
Maximum Number of Simultaneous Kernels
16. Jupyter Enterprise Gateway: Initial Goals
Optimized Resource Allocation
§ Run Spark in YARN Cluster Mode to better utilize cluster resources
§ Pluggable architecture for additional Resource Managers and Lifecycle Management
§ General framework for remote kernels
Multiuser support with user impersonation
§ Enhance security and sandboxing by enabling user impersonation when running kernels (using
Kerberos)
§ Individual HDFS home folder for each notebook user
§ Enables use of same user ID for both notebook and batch jobs
Enhanced Security
§ Secure socket communications
§ Any network communication should be encrypted
Jupyter Enterprise Gateway
Jupyter Kernel Gateway
Jupyter Notebook
jupyter_client
21. Use Case – Sentiment Analysis
Utilizing Yelp Dataset from Kaggle
Utilizing AFINN sentiment analysis library in Python
Using PySpark for training and scoring model
Using Jupyter Kernels to integrate between micro service and spark
Yelp Dataset: https://www.kaggle.com/yelp-dataset/yelp-dataset
22. Use Case – Sentiment Analysis
Integrates analytics to your regular
python application/micro service by
leveraging Jupyter Notebook
Kernels
YARN Cluster
YARN
Workers
Enterprise Gateway Node
Jupyter Enterprise Gateway
Multitenancy
Remote kernels and Kernel Lifecycle management
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Sentiment
Resource
flask
Kernel
Laucher/Client
http
Sentiment
Provider
24. Application – Sentiment REST API
• Leverage Flask-RESTFul
• Exposes a sentiment REST API
• Request sentiment for a given business
• http://<host>:5000/sentiment/<business_id>
• During Application startup
• Start the kernel
• Perform required data load operations
25. Application – Sentiment Provider
• Encapsulate all interactions with the
kernel
• Start/Stop the Kernel
• Load necessary tables
• Retrieve business details
• Calculate sentiment for each business review
• Possible enhancements/todos
• Return data as Numpy Arrays
• Provide more flexibility to manipulate
display on the resource side (e.g pretty html)
26. Application – Kernel Launcher
• Encapsulate kernel lifecycle
• Start/Stop the Kernel
• Instantiate a new kernel object
29. Further Readings
Connect direct to kernel
• See Jupyter Client https://github.com/jupyter/jupyter_client
• See https://github.com/lresende/toree-gateway/blob/master/python/toree_client.py
Jupyter Kernel Gateway/Enterprise Gateway HTTP Personality
• Starts the Gateway in single notebook mode
• Notebook cells (based on resource identifier comment) become accessible via URLs
# GET /hello/world
print("I'm cell #1")
• http://jupyter-kernel-gateway.readthedocs.io/en/latest/http-mode.html