GENERAL DISTRIBUTION
Democratizing Data Science
on Kubernetes
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
DATA SCIENCE PRESSURES
EXPLOSIVE GROWTH
in data analytics teams and analytic tools
MULTIPLE TEAMS COMPETING
for use of the same storage and computing resources
CONGESTION
in busy analytic clusters causing frustration and missed SLAs
EMERGING DATAOPS
Data Scientist Developers vs Full Stack Developer agility and enablement gaps
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
NEED: SHARE CODE (PRODUCT) WITH USERS
Jupyter Notebooks as a technology we could use to combine python code, a GUI, documentation for sharing with
customers.
Start of a Interactive Data Science environment.
Red Hat OpenShift PoC at XOM. Could this new technology benefit us in creating a
Reproducible & Interactive Data Science environment?
Prize: This would enable the team to not only quickly obtain customer feedback, but
also easily utilize Agile Methodology; therefore, quickly delivering MVPs.
Drawback: how does one
avoid the
setup/configuration
issues and reliably
deploy the notebook? Pip install required
Anaconda libraries
Jupyter Notebook Python 3.x
(load onto PC – or setup server)
Local admin access
Access to latest source code
OS?SQL
Server
PC Setup
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
LOCAL PC VS OPENSHIFT PROJECT CONTAINERS
Jupyter Notebook
Python 3.x
(image)
Libraries
• Numpy
• Pandas
• Matplotlib
• IPyWidgets
• SciPy
• Lmfit
• Seaborne
• Plotly
SQLite
Container v2.0
GIT
Image project
Code project
OpenShift
URL
to PoCCode
Local PC Setup
pip install required
Anaconda libraries
Jupyter Notebook Python 3.x
(load onto PC – or setup server)
Local admin access
Access to latest source code
OS?SQL
Server
Reproducible Data Science environment that users interact with via Chrome.
Hardware Freedom
& easier
Reproduction!
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
For a Data Scientist, the ability to rapidly deploy code and quickly obtain feedback from a
user is extremely valuable and Agile! Openshift facilitates these capabilities!
REPRODUCIBLE & INTERACTIVE SCIENTIFIC ENVIRONMENT
1. Understand the
Problem
2. Suggest
Solutions
Deliver POC
3. Refine the
Problem
Agile
How to Deploy? No worries: Supported Kubernetes with OpenShift
URL
to
PoC
Code
GIT
Image project
Code project
OpenShift
“Interactive” feedback!
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
Moving Forward: ExxonMobil Data Science Capability today!
As a Data Scientist (all I care about) is that using Openshift, I can now deploy a common Jupyter Notebook /
Anaconda image (with all required libraries) in a matter of seconds.
Freeing myself (and other Data Scientists) to perform data science and not worry about architecture and delivery
mechanisms. Now that is Democratizing Data Science!
Selected Openshift on premises and public cloud for Container as a Service (CaaS)
• Openshift supports:
• One Click Notebooks and JupyterHub/Lab templates
• Self-service for accessing data & data science packages
• Nexus Repository to allow for Python, Java, R, PHP, .Net Core package managers
• Docker public repository security built-in process – protects against rooted containers
and new CVE attacks
• NVidia GPU support allows for sharing these resources across multiple teams
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
DATA SCIENTIST DEVELOPERS NEEDS
All Developers need
â—Ź Choice of architectures
â—Ź Choice of programming languages
â—Ź Choice of databases and persistence
â—Ź Choice of application services
â—Ź Choice of development tools
â—Ź Choice of build and deploy workflows
Data Science Additional Needs
â—Ź Access to GPUs and TPUs
â—Ź Access to Curated Data
â—Ź Automated pipelines
â—Ź Collaboration with the Business
â—Ź Access to specific data science
languages and toolsets
They don’t want to have to worry about the infrastructure.
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on KubernetesRICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
YOUR DIFFERENTIATION DEPENDS ON YOUR
ABILITY TO DELIVER INTELLIGENT APPS FASTER
CONTAINERS, KUBERNETES, DEVOPS & DATAOPS ARE KEY INGREDIENTS
Innovation
Culture
Cloud-native
Applications
AI & Machine
Learning
Internet of
Things
Virtual GPU
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
WHY DO CONTAINERS NEED KUBERNETES?
CONTAINERIZED APPLICATIONS
MANAGE CONTAINERS SECURELY
MANAGE CONTAINERS AT SCALE
INTEGRATE IT OPERATIONS
ENABLE HYBRID CLOUD
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
REFERENCE ARCHITECTURE
FOR ENTERPRISE KUBERNETES
*coming soon
Automated Operations*
Kubernetes
Red Hat Enterprise Linux or Red Hat CoreOS
Application
Services
CaaS PaaSBest IT Ops Experience Best Developer Experience
Cluster
Services
Developer
Services
Middleware, Service Mesh, Functions, ISV Metrics, Chargeback, Registry, Logging Dev Tools, Automated Builds, CI/CD, IDE
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
MODERN DATA ANALYTICS PIPELINE
KEY TERMINOLOGY
DATA
GENERATION
INGEST DATA
SCIENCE
MACHINE
LEARNING
STREAM
PROCESSING
TRANSFORM,
MERGE, JOIN
DATA
ANALYTICS
• IoT Telemetry
• G&G - Well Logs
• Transactions
• Production
• NiFi
• Kafka
• MQTT
• Presto
• Impala
• SparkSQL
• Notebooks
• TensorFlow
• PyTorch
• Keras
• scikit-learn
• Kafka • Hadoop
• Spark
• Pandas
• Apache Arrow
• Spark
• Hadoop
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
â—Ź Kubeflow
â—‹ Tensorflow
â—‹ Seldon
â—‹ JupyterHub
â—‹ PyTorch
â—Ź radanalytics.io
â—‹ Oshinko - Apache Spark Cluster
â—‹ source-to-image (S2I)
â—Ź KAML-D - Early Stage JupyterLab Plugin
â—‹ Data Explorer
â—‹ Containerized CURLable data
â–  Dotmesh, Minio, Ceph
â—‹ Data Versioning and Metadata
OSS DATA SCIENCE PROJECTS
RICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes
â—Ź Openshift Self Service Education - https://learn.openshift.com
â—Ź Install Minishift -
https://docs.okd.io/latest/minishift/getting-started/installing.html
â—‹ MacOS - brew cask install minishift
â—‹ Manual - https://github.com/minishift/minishift/releases
â—Ź Install Jupyter and JupyterHub Openshift templates
â—‹ https://github.com/jupyter-on-openshift/jupyterhub-quickstart
â—Ź Review the projects at https://radanalytics.io
HOW CAN I GET STARTED?

Democratizing Data Science on Kubernetes

  • 1.
  • 2.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes
  • 3.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes DATA SCIENCE PRESSURES EXPLOSIVE GROWTH in data analytics teams and analytic tools MULTIPLE TEAMS COMPETING for use of the same storage and computing resources CONGESTION in busy analytic clusters causing frustration and missed SLAs EMERGING DATAOPS Data Scientist Developers vs Full Stack Developer agility and enablement gaps
  • 4.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes NEED: SHARE CODE (PRODUCT) WITH USERS Jupyter Notebooks as a technology we could use to combine python code, a GUI, documentation for sharing with customers. Start of a Interactive Data Science environment. Red Hat OpenShift PoC at XOM. Could this new technology benefit us in creating a Reproducible & Interactive Data Science environment? Prize: This would enable the team to not only quickly obtain customer feedback, but also easily utilize Agile Methodology; therefore, quickly delivering MVPs. Drawback: how does one avoid the setup/configuration issues and reliably deploy the notebook? Pip install required Anaconda libraries Jupyter Notebook Python 3.x (load onto PC – or setup server) Local admin access Access to latest source code OS?SQL Server PC Setup
  • 5.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes LOCAL PC VS OPENSHIFT PROJECT CONTAINERS Jupyter Notebook Python 3.x (image) Libraries • Numpy • Pandas • Matplotlib • IPyWidgets • SciPy • Lmfit • Seaborne • Plotly SQLite Container v2.0 GIT Image project Code project OpenShift URL to PoCCode Local PC Setup pip install required Anaconda libraries Jupyter Notebook Python 3.x (load onto PC – or setup server) Local admin access Access to latest source code OS?SQL Server Reproducible Data Science environment that users interact with via Chrome. Hardware Freedom & easier Reproduction!
  • 6.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes For a Data Scientist, the ability to rapidly deploy code and quickly obtain feedback from a user is extremely valuable and Agile! Openshift facilitates these capabilities! REPRODUCIBLE & INTERACTIVE SCIENTIFIC ENVIRONMENT 1. Understand the Problem 2. Suggest Solutions Deliver POC 3. Refine the Problem Agile How to Deploy? No worries: Supported Kubernetes with OpenShift URL to PoC Code GIT Image project Code project OpenShift “Interactive” feedback!
  • 7.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes Moving Forward: ExxonMobil Data Science Capability today! As a Data Scientist (all I care about) is that using Openshift, I can now deploy a common Jupyter Notebook / Anaconda image (with all required libraries) in a matter of seconds. Freeing myself (and other Data Scientists) to perform data science and not worry about architecture and delivery mechanisms. Now that is Democratizing Data Science! Selected Openshift on premises and public cloud for Container as a Service (CaaS) • Openshift supports: • One Click Notebooks and JupyterHub/Lab templates • Self-service for accessing data & data science packages • Nexus Repository to allow for Python, Java, R, PHP, .Net Core package managers • Docker public repository security built-in process – protects against rooted containers and new CVE attacks • NVidia GPU support allows for sharing these resources across multiple teams
  • 8.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes DATA SCIENTIST DEVELOPERS NEEDS All Developers need ● Choice of architectures ● Choice of programming languages ● Choice of databases and persistence ● Choice of application services ● Choice of development tools ● Choice of build and deploy workflows Data Science Additional Needs ● Access to GPUs and TPUs ● Access to Curated Data ● Automated pipelines ● Collaboration with the Business ● Access to specific data science languages and toolsets They don’t want to have to worry about the infrastructure.
  • 9.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on KubernetesRICE DATA SCIENCE CONFERENCE - 2018 - Democratizing Data Science on Kubernetes YOUR DIFFERENTIATION DEPENDS ON YOUR ABILITY TO DELIVER INTELLIGENT APPS FASTER CONTAINERS, KUBERNETES, DEVOPS & DATAOPS ARE KEY INGREDIENTS Innovation Culture Cloud-native Applications AI & Machine Learning Internet of Things Virtual GPU
  • 10.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes WHY DO CONTAINERS NEED KUBERNETES? CONTAINERIZED APPLICATIONS MANAGE CONTAINERS SECURELY MANAGE CONTAINERS AT SCALE INTEGRATE IT OPERATIONS ENABLE HYBRID CLOUD
  • 11.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes REFERENCE ARCHITECTURE FOR ENTERPRISE KUBERNETES *coming soon Automated Operations* Kubernetes Red Hat Enterprise Linux or Red Hat CoreOS Application Services CaaS PaaSBest IT Ops Experience Best Developer Experience Cluster Services Developer Services Middleware, Service Mesh, Functions, ISV Metrics, Chargeback, Registry, Logging Dev Tools, Automated Builds, CI/CD, IDE
  • 12.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes MODERN DATA ANALYTICS PIPELINE KEY TERMINOLOGY DATA GENERATION INGEST DATA SCIENCE MACHINE LEARNING STREAM PROCESSING TRANSFORM, MERGE, JOIN DATA ANALYTICS • IoT Telemetry • G&G - Well Logs • Transactions • Production • NiFi • Kafka • MQTT • Presto • Impala • SparkSQL • Notebooks • TensorFlow • PyTorch • Keras • scikit-learn • Kafka • Hadoop • Spark • Pandas • Apache Arrow • Spark • Hadoop
  • 13.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes â—Ź Kubeflow â—‹ Tensorflow â—‹ Seldon â—‹ JupyterHub â—‹ PyTorch â—Ź radanalytics.io â—‹ Oshinko - Apache Spark Cluster â—‹ source-to-image (S2I) â—Ź KAML-D - Early Stage JupyterLab Plugin â—‹ Data Explorer â—‹ Containerized CURLable data â–  Dotmesh, Minio, Ceph â—‹ Data Versioning and Metadata OSS DATA SCIENCE PROJECTS
  • 14.
    RICE DATA SCIENCECONFERENCE - 2018 - Democratizing Data Science on Kubernetes â—Ź Openshift Self Service Education - https://learn.openshift.com â—Ź Install Minishift - https://docs.okd.io/latest/minishift/getting-started/installing.html â—‹ MacOS - brew cask install minishift â—‹ Manual - https://github.com/minishift/minishift/releases â—Ź Install Jupyter and JupyterHub Openshift templates â—‹ https://github.com/jupyter-on-openshift/jupyterhub-quickstart â—Ź Review the projects at https://radanalytics.io HOW CAN I GET STARTED?