Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cluster - Codemotion Berlin 2018

Scaling Big Data
Interactive Workloads
across Kubernetes Cluster
Luciano Resende
Codemotion Berlin - 2018
1© 2018 IBM Corporation

© 2018 IBM Corporation
About me - Luciano Resende
2
Open Source AI Platform Architect – IBM – CODAIT
• Senior Technical Staff Member at IBM, contributing to open source for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Toree, Apache Spark among other projects related to AI/ML platforms
lresende@us.ibm.com
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende

3
Learn
Open Source @ IBM
Program touches
78,000
IBMers annually
Consume
Virtually all
IBM products
contain some
open source
• 40,363 pkgs
Per Year
Contribute
• >62K OS Certs
per year
• ~10K IBM
commits per
month
Connect
> 1000
active IBM
Contributors
Working in key OS
projects
2018 / © 2018 IBM Corporation
IBM Open Source Participation

4
IBM Open Source Participation
IBM generated open source innovation
• 137 Code Open (dWO) projects w/1000+ Github projects
• 4 graduates: Node-Red, OpenWhisk, SystemML,
Blockchain fabric to full open governance in the last year
• developer.ibm.com/code/open/code/
Community
• IBM focused on 18 strategic communities
• Drive open governance in “Centers of Gravity”
• IBM Leaders drive key technologies and assure freedom
of action
The IBM OS Way is now open sourced
• Training, Recognition, Tooling
• Organization, Consuming, Contributing

Center for Open Source
Data and AI Technologies
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
5

Interactive
Development with
Jupyter Notebooks

Jupyter Notebooks
7
Notebooks are interactive
computational
environments, in which
you can combine code
execution, rich text,
mathematics, plots and
rich media.

Jupyter Notebooks
8
• Notebook UI runs on the browser
• The Notebook Server serves the
’Notebooks’
• Kernels interpret/execute cell
contents
– Are responsible for code execution
– Abstracts different languages
– 1:1 relationship with Notebook
– Runs and consume resources as long as
notebook is running

Analytics and Deep Learning
Workloads

Analytics Workloads
10
Large amount of data
Shared across organization in Data Lakes
Multiple workload types
- Data cleansing
- Data Warehouse
- ML and Insights

Deep Learning Workloads
11
Resource Intensive workloads
Requires expensive hardware (GPU, TPU)
Long Running training jobs
- Simple MNIST takes over one hour
WITHOUT a decent GPU
- Other non complex deep learning model
training can easily take over a day WITH
GPUs

Local Development Environment
12

Development Environment Evolution
13
Python
Environments
Anaconda …

Analytic and AI Platforms
14
Large pool of shared computing
resources
• Enterprise Cloud, Public Cloud or Hybrid
• Shared Data (Data Lakes/Object Storage)
Distributed Consumers
• Notebooks running local (users laptop)
or as a service (e.g. Jupyter Hub)
Different Resource Utilization Patterns
• High number of idle resources

Limitations of Jupyter Notebook Stack
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
15
8 8 8 8
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 NodesMaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
MAXIMUM NUMBER OF
SIMULTANEOUS KERNELS
• Scalability
• Jupyter Kernels running as local process
• Resources are limited by what is available
on the one single node that runs all Kernels
and associated Spark drivers
• Security
• Single user sharing the same privileges
• Users can see and control each other process
using Jupyter administrative utilities
Kernel
Kernel
Kernel
Kernel
Kernel

Jupyter Enterprise Gateway

Jupyter Enterprise
Gateway
Jupyter Enterprise Gateway at IBM Code
https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/
Jupyter Enterprise Gateway source code at GitHub
https://github.com/jupyter-incubator/enterprise_gateway
Jupyter Enterprise Gateway Documentation
http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Supported Kernels
Supported Platforms
17
A lightweight, multi-tenant, scalable
and secure gateway that enables
Jupyter Notebooks to share resources
across an Apache Spark or Kubernetes
cluster for Enterprise/Cloud use cases
Spectrum Conductor
+

Jupyter Enterprise Gateway Features
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
18
16
32
48
64
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 NodesMaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
MAXIMUM NUMBER OF
SIMULTANEOUS KERNELS
Optimized Resource Allocation
– Utilize resources on all cluster nodes by running kernels as Spark
applications in YARN Cluster Mode.
– Pluggable architecture to enable support for additional Resource Managers
Enhanced Security
– End-to-End secure communications
• Secure socket communications
• Encrypted HTTP communication using SSL
Multiuser support with user impersonation
– Enhance security and sandboxing by enabling user impersonation when
running kernels (using Kerberos).
– Individual HDFS home folder for each notebook user.
– Use the same user ID for notebook and batch jobs.
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel

Jupyter Notebooks
and Kubernetes

Development Environment Evolution
20
Python
Environments
Anaconda Analytics Platform …

Jupyter & Kubernetes
21
Kubernetes Platform
- Containers provides a flexible way to
deploy applications and are here to stay
- Containers simplify management of
complicated and heterogenous AI/Deep
Learning infratructure
- Kubernetes enables easy management
of containerized applications and
resources with the benefit of Elasticity
and Quality of Services
Source: https://github.com/Langhalsdino/Kubernetes-GPU-Guide

Enterprise Gateway & Kubernetes
Supported Platforms
FfDL
Before Enterprise Gateway After Enterprise Gateway
Before Jupyter Enterprise Gateway …
• Resources required for all kernels needs to
be allocated during Notebook Server pod
creation
• Resources limited to what is physically
available on the host node that runs all
kernels and associated Spark drivers
After Jupyter Enterprise Gateway …
• Gateway pod very lightweight
• Kernels in their own pod, isolation
• Kernel pods built from community images:
Spark-on-K8s, TensorFlow, Keras, etc.

Jupyter Enterprise Gateway - Kubernetes
23
Container images defined in kernelspec
Community image
Kernel
Spark on K8
Kernel
Distributed
File
System
Vanilla Kernels
Spark based kernels
Gateway
nb2kg
nb2kg

March 30 2018 / © 2018 IBM Corporation
25
• Multi-user Enterprise Gateway pod
• Each kernel launched on it’s own pod
• Kernel pod namespace is configurable
Jupyter & Kubernetes

Jupyter Kernels are configured by kernelspecs
• Each kernel has a correspondent kernel spec
• Stored in one of the Jupyter data path
• $ jupyter kernelspec list
Enabling remote kernels
/…/anaconda3/share/jupyter/kernels/python2/kernel.jsom

Process Proxy:
• Abstracts kernel process represented by Jupyter
framework
• Pluggable class definition identified in kernelspec
(kernel.json)
• Manages kernel lifecycle
Kernel Launcher:
• Embeds target kernel
• Listens on gateway communication port
• Conveys interrupt requests (via local signal)
• Could be extended for additional communications
{
"language": "python",
"display_name": "Spark - Python (Kubernetes Mode)",
"process_proxy": {
"class_name":
"enterprise_gateway.services.processproxies.k8s.KubernetesProcessP
roxy",
"config": {
"image_name": "elyra/kubernetes-kernel-py:dev",
"executor_image_name": "elyra/kubernetes-kernel-py:dev”,
"port_range" : "40000..42000"
}
},
"env": {
"SPARK_HOME": "/opt/spark",
"SPARK_OPTS": "--master k8s://https://${KUBERNETES_SERVICE_HOST
--deploy-mode cluster --name …",
…
},
"argv": [
"/usr/local/share/jupyter/kernels/spark_python_kubernetes/bin/run.
sh",
"{connection_file}",
"--RemoteProcessProxy.response-address",
"{response_address}",
"--RemoteProcessProxy.spark-context-initialization-mode",
"lazy"
]
}
Enabling remote kernels
Process Proxies mixed with Kernel Launchers

Jupyter Enterprise Gateway Components
28
Spectrum Conductor
+
Supported
Runtime
Platforms
J U P Y T E R E N T E R P R I S E G A T E W A Y
Remote
Kernel Manager
Distributed
Process Proxy
YARN Cluster
Process Proxy
Kubernetes
Process Proxy
Conductor Cluster
Process Proxy
J U P Y T E R N O T E B O O K
NB2KG Extension Lab Extension
J U P Y T E R K E R N E L G A T E W A Y
J U P Y T E R N O T E B O O K
FfDL

Jupyter Notebooks
and Deep Learning Platforms

Deep Learning Platforms
30
Prohibited costs
- Deep Learning resources are prohibitive in
costs to be locked/idle during interactive
development
Deep Learning Platforms
- We have seen the rise of Deep Learning
platforms that leverage containers and
Kubernetes as the basis of their
infrastructure
- Kubernetes enables Deep Learning
platforms to easily share and restrict
accelerated hardware
Fabric for Deep Learning
IBM Watson Studio
Deep Learning as a service
Batch
oriented
developm
ent

Deep Learning Workspace
31
Streamline Data Science user experience when
coming from Notebook/Interactive
development interfaces
• Current process include multiple steps, one
being decomposing the notebook into an
application that needs to be submitted as a
zip to the deep learning runtime which
becomes a show stopper for data scientists
to adopt FfDL and DLaas

32
Streamline the Deep Learning application
lifecycle
• Run local notebook experiments, with small data
samples and seamlessly validate experiments on
Deep Learning environments
• IBM Cloud DLaaS, FfDL (open source), KubeFlow
(open source)
Simplify productionalization of Model training
and serving from Notebooks
• Enable running/scheduling notebooks on
production environments as batch jobs
• Results can be made available via updated
notebook, or exported to html, pdf and a few
other formats.
Interactive development
lifecycle done on
commodity hardware
with sampled data
Training on full dataset
gets scheduled as
batch jobs on deep
learning infrastructure

34
• User select where to run the
experiment
• Job is packaged and submitted on
behalf of user
• User has access to Job Console to
monitor experiment

Thank you!
@lresende1975

Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cluster - Codemotion Berlin 2018

More Related Content

What's hot

Similar to Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cluster - Codemotion Berlin 2018

More from Codemotion

Recently uploaded

Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cluster - Codemotion Berlin 2018