Democratizing Data Science on Kubernetes

GENERAL DISTRIBUTION
Democratizing Data Science
on Kubernetes

RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes

DATA SCIENCE PRESSURES
EXPLOSIVE GROWTH
in data analytics teams and analytic tools
MULTIPLE TEAMS COMPETING
for use of the same storage and computing resources
CONGESTION
in busy analytic clusters causing frustration and missed SLAs
EMERGING DATAOPS
Data Scientist Developers vs Full Stack Developer agility and enablement gaps

NEED: SHARE CODE (PRODUCT) WITH USERS
Jupyter Notebooks as a technology we could use to combine python code, a GUI, documentation for sharing with
customers.
Start of a Interactive Data Science environment.
Red Hat OpenShift PoC at XOM. Could this new technology benefit us in creating a
Reproducible & Interactive Data Science environment?
Prize: This would enable the team to not only quickly obtain customer feedback, but
also easily utilize Agile Methodology; therefore, quickly delivering MVPs.
Drawback: how does one
avoid the
setup/configuration
issues and reliably
deploy the notebook? Pip install required
Anaconda libraries
Jupyter Notebook Python 3.x
(load onto PC – or setup server)
Local admin access
Access to latest source code
OS?SQL
Server
PC Setup

LOCAL PC VS OPENSHIFT PROJECT CONTAINERS
Jupyter Notebook
Python 3.x
(image)
Libraries
• Numpy
• Pandas
• Matplotlib
• IPyWidgets
• SciPy
• Lmfit
• Seaborne
• Plotly
SQLite
Container v2.0
GIT
Image project
Code project
OpenShift
URL
to PoCCode
Local PC Setup
pip install required
Anaconda libraries
Jupyter Notebook Python 3.x
(load onto PC – or setup server)
Local admin access
Access to latest source code
OS?SQL
Server
Reproducible Data Science environment that users interact with via Chrome.
Hardware Freedom
& easier
Reproduction!

For a Data Scientist, the ability to rapidly deploy code and quickly obtain feedback from a
user is extremely valuable and Agile! Openshift facilitates these capabilities!
REPRODUCIBLE & INTERACTIVE SCIENTIFIC ENVIRONMENT
1. Understand the
Problem
2. Suggest
Solutions
Deliver POC
3. Refine the
Problem
Agile
How to Deploy? No worries: Supported Kubernetes with OpenShift
URL
to
PoC
Code
GIT
Image project
Code project
OpenShift
“Interactive” feedback!

Moving Forward: ExxonMobil Data Science Capability today!
As a Data Scientist (all I care about) is that using Openshift, I can now deploy a common Jupyter Notebook /
Anaconda image (with all required libraries) in a matter of seconds.
Freeing myself (and other Data Scientists) to perform data science and not worry about architecture and delivery
mechanisms. Now that is Democratizing Data Science!
Selected Openshift on premises and public cloud for Container as a Service (CaaS)
• Openshift supports:
• One Click Notebooks and JupyterHub/Lab templates
• Self-service for accessing data & data science packages
• Nexus Repository to allow for Python, Java, R, PHP, .Net Core package managers
• Docker public repository security built-in process – protects against rooted containers
and new CVE attacks
• NVidia GPU support allows for sharing these resources across multiple teams

DATA SCIENTIST DEVELOPERS NEEDS
All Developers need
● Choice of architectures
● Choice of programming languages
● Choice of databases and persistence
● Choice of application services
● Choice of development tools
● Choice of build and deploy workflows
Data Science Additional Needs
● Access to GPUs and TPUs
● Access to Curated Data
● Automated pipelines
● Collaboration with the Business
● Access to specific data science
languages and toolsets
They don’t want to have to worry about the infrastructure.

RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on KubernetesRICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
YOUR DIFFERENTIATION DEPENDS ON YOUR
ABILITY TO DELIVER INTELLIGENT APPS FASTER
CONTAINERS, KUBERNETES, DEVOPS & DATAOPS ARE KEY INGREDIENTS
Innovation
Culture
Cloud-native
Applications
AI & Machine
Learning
Internet of
Things
Virtual GPU

WHY DO CONTAINERS NEED KUBERNETES?
CONTAINERIZED APPLICATIONS
MANAGE CONTAINERS SECURELY
MANAGE CONTAINERS AT SCALE
INTEGRATE IT OPERATIONS
ENABLE HYBRID CLOUD

REFERENCE ARCHITECTURE
FOR ENTERPRISE KUBERNETES
*coming soon
Automated Operations*
Kubernetes
Red Hat Enterprise Linux or Red Hat CoreOS
Application
Services
CaaS PaaSBest IT Ops Experience Best Developer Experience
Cluster
Services
Developer
Services
Middleware, Service Mesh, Functions, ISV Metrics, Chargeback, Registry, Logging Dev Tools, Automated Builds, CI/CD, IDE

MODERN DATA ANALYTICS PIPELINE
KEY TERMINOLOGY
DATA
GENERATION
INGEST DATA
SCIENCE
MACHINE
LEARNING
STREAM
PROCESSING
TRANSFORM,
MERGE, JOIN
DATA
ANALYTICS
• IoT Telemetry
• G&G - Well Logs
• Transactions
• Production
• NiFi
• Kafka
• MQTT
• Presto
• Impala
• SparkSQL
• Notebooks
• TensorFlow
• PyTorch
• Keras
• scikit-learn
• Kafka • Hadoop
• Spark
• Pandas
• Apache Arrow
• Spark
• Hadoop

● Kubeflow
○ Tensorflow
○ Seldon
○ JupyterHub
○ PyTorch
● radanalytics.io
○ Oshinko - Apache Spark Cluster
○ source-to-image (S2I)
● KAML-D - Early Stage JupyterLab Plugin
○ Data Explorer
○ Containerized CURLable data
■ Dotmesh, Minio, Ceph
○ Data Versioning and Metadata
OSS DATA SCIENCE PROJECTS

● Openshift Self Service Education - https://learn.openshift.com
● Install Minishift -
https://docs.okd.io/latest/minishift/getting-started/installing.html
○ MacOS - brew cask install minishift
○ Manual - https://github.com/minishift/minishift/releases
● Install Jupyter and JupyterHub Openshift templates
○ https://github.com/jupyter-on-openshift/jupyterhub-quickstart
● Review the projects at https://radanalytics.io
HOW CAN I GET STARTED?

Democratizing Data Science on Kubernetes

More Related Content

What's hot

Similar to Democratizing Data Science on Kubernetes

More from John Archer

Recently uploaded

Democratizing Data Science on Kubernetes