IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale. Behind this service, there are various components that power this platform, including Jupyter Notebooks, an enterprise gateway that manages the execution of the Jupyter Kernels and an Apache Spark cluster that power the computation. In this session we will describe our experience and best practices putting together this analytical platform as a service based on Jupyter Notebooks and Apache Spark, in particular how we built the Enterprise Gateway that enables all the Notebooks to share the Spark cluster computational resources.
VoIP Service and Marketing using Odoo and Asterisk PBX
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
1. IBM SparkTechnology Center
Big Data Spain – Nov 2017
The Analytic Platform behind IBM’s Watson Data Platform
Luciano Resende
IBM | Spark Technology Center
2. 2
Data Science Platform Architect – IBM – Spark Technology Center
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Spark, Apache Toree among other projects related to Apache Spark ecosystem
lresende@apache.org
http://lresende.blogspot.com/
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
@
About me - Luciano Resende
3. Open Source Community Leadership
Spark Technology Center
Founding Partner 188+ Project Committers 77+ Projects
Key Open source steering committee
memberships OSS Advisory Board
Open Source
4. IBM SparkTechnology Center
IBM Spark Technology Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com
Key statistics:
About 40 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now a top level Apache project !
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
4
5. IBM SparkTechnology Center
Agenda
IBM Data Science Experience
IBM Analytics Engine
Challenges faced building Analytic Platform
Jupyter Enterprise Gateway
References
5
6. IBM SparkTechnology Center
IBM Data Science
Experience is an
environment that brings
together everything that a
Data Scientist needs to be
more productive, including
tools, data and content
Be a better data scientist
IBM Data Science Experience (DSX)
7. IBM SparkTechnology Center
DSX is built on a foundation of open source,
primarily Jupyter notebooks
Notebooks are interactive
computational
environments, in which
you can combine code
execution, rich text,
mathematics, plots and
rich media.
8. IBM SparkTechnology Center
Jupyter Notebook Platform Architecture
• Notebook UI runs on the browser
• The Notebook Server serves the ’Notebooks’
• Kernels interpret/execute cell contents
• Are responsible for code execution
• Abstracts different languages
8
11. IBM SparkTechnology Center
IBM Analytics Engine - Characteristics
IBM Analytics Engine is built on
open source Apache Hadoop
and Apache Spark. It provides
users flexibility of open source
and an opportunity to expand
on their existing open source
investments
IBM Analytics Engine helps Data
scientists, Data engineers, and
Developers to focus on building data
models and business solutions while
simplifying cluster administration
through easy to use interfaces for
management and integration
IBM Analytics Engine deploys
clusters in minutes with
enterprise-level security,
reliability, and powerful
integration capabilities for
data management, monitoring,
and dashboards.
12. IBM SparkTechnology Center
Capabilities
Separation of compute and storage
• Scale compute and storage independently for
better economics
• Separate compute and storage ensure no data-
loss in cases of cluster failure
• Ease of incorporating patches or upgrades by
creating new clusters
• Spin up use case specific clusters using different
instance sizes for different use cases
• Uniform governance and collaboration through
WDP services
Ease of use and administration
• Access and administer through multiple
interfaces – Cloud Foundry CLI, REST APIs on
public interface, and GUI
• Enhanced flexibility for configuring and
clusters, including installing 3rd party libraries
through bootstrap scripts
• Deploy and scale clusters within minutes, in a
few clicks, including propagating libraries and
configurations to all nodes of the cluster
13. IBM SparkTechnology Center
Capabilities
* Roadmap item
Enhanced reliability and security
• ‘Auto-heal’ capability recovers processes from
failure *
• Geo-replicated object store for disaster
avoidance
• Encrypted object store, data-at-rest, and data-
in-motion encryption* provide enhanced
levels of security
Flexibility and innovation of open source
• Built on ODPi compliant Apache Spark and Apache
Hadoop stack for portability between open source
environments
• Integrate analytics tools using standard, open
source libraries and drivers
14. IBM SparkTechnology Center
Enterprise/Cloud Analytics Platform Characteristics
Large pool of shared computing resources
• Enterprise Cloud, Public Cloud or Hybrid
• Data in the cloud (Data Lakes/Object Storage)
Distributed Consumers
• Notebooks running local (users laptop) or as a service
Different Resource Utilization Patterns
• High number of idle resources
14
15. IBM SparkTechnology Center
Analytics Platform – Current state of the art
Open Source Jupyter based Notebook Platform
• Single User sharing the same distributed filesystem and privileges
• Jupyter Kernels running as local process
• Resources are limited by what is available on the one single node that runs all Kernels and associated Spark drivers.
• No security, users can see and control each others process using Jupyter’s administration
utilities.
15
16. IBM SparkTechnology Center
Analytics Platform Today – Shared Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels inside the cluster
sharing it’s resources.
• All Jupyter kernels run under a shared, “service” user ID.
• Users can see and control each others’ kernels using
Jupyter’s administration utilities.
• All kernels and their associated Spark drivers run on a
single (configurable) node of the cluster.
16
Spark Cluster
Bob’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Kernel
[Spark Driver]
(yarn-client mode as
JNBG Service User)
YARN
Workers
Bob’s Desktop
Multiple Notebooks
Jupyter
Notebook
Server
(with NB2KG)
Security
Layer
Kernel
[Spark Driver]
(yarn-client mode as
JNBG Service User)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
17. IBM SparkTechnology Center
Analytics Platform Today – Single User Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels in a cluster created
specially to the user.
• Expensive as clusters are created for every individual
user
17
Spark Cluster
Bob’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Kernel
[Spark Driver]
(yarn-client mode as
JNBG Service User)
YARN
Workers
19. IBM SparkTechnology Center
Jupyter Enterprise Gateway
A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter
Notebooks to share resources across an Apache Spark cluster aiming on
Enterprise/Cloud requirements and use cases
19
20. IBM SparkTechnology Center
Jupyter Enterprise Gateway – Goals
Optimized Resource Allocation
•Run Spark in YARN Cluster Mode to better utilize cluster resources.
•Pluggable architecture for additional Resource Managers
Enhanced Security
•Enable TLS for all socket communications
•Any HTTP communication should be encrypted (SSL)
Multiuser support with user impersonation
•Enhance security and sandboxing by enabling user impersonation when running kernels.
•Individual HDFS home folder for each notebook user.
•Use the same user ID for notebook and batch jobs.
20
21. IBM SparkTechnology Center
Jupyter Enterprise Gateway
Supported Platforms
• Python/Spark 2.x using IPython kernel
• With Spark Context delayed initialization
• Scala 2.11/ Spark 2.x using Apache Toree kernel
• With Spark Context delayed initialization
• R / Spark 2.x with IRkernel
21
25. IBM SparkTechnology Center
Jupyter Enterprise Gateway – Roadmap
• Kernel Configuration Profile
• Enable client to request different resource configuration for kernels (e.g. small, medium, large)
• Profiles should be defined by Administrators and enabled for user/group of users.
• Administration UI
• Dashboard with running kernels and administration actions
• Time running, stop/kill, Profile Management, etc
• Add support for other resource managers
• User Environments
• High Availability
25
26. IBM SparkTechnology Center
Jupyter Enterprise Gateway
Jupyter Enterprise Gateway at IBM Code
https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/
Jupyter Enterprise Gateway no GitHub
https://github.com/jupyter-incubator/enterprise_gateway
Jupyter Enterprise Gateway Documentation
http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
26
Jupyter Enterprise
Gateway 0.7 release
coming out today