Yannis Zarkadas. Enterprise data science workflows on kubeflow

Enterprise Data Science Workflows on Kubeflow
Use GitOps to deploy and manage your Kubeflow cluster.
Perform an end-to-end data science workflow on Kubeflow.
Stefano Fioravanzo
Yannis Zarkadas
Arrikto

Simplify. Accelerate. Collaborate. arrik.to/odsc20
GitOps and Multi-Tenancy Combined for an
Enterprise Data Science Experience on Kubeflow
Stefano Fioravanzo Yannis Zarkadas
Software Engineer Software Engineer
2

● How to deploy and manage Kubeflow in a GitOps manner
● How to make sure you run Kubeflow in a secure way
● How to optimize and build production-ready models faster
Why is this important?
✓ Simplify deployment and management of Kubeflow
✓ Accelerate time to production
✓ Collaborate in a secure and isolated manner
What You’ll Learn In This Session
3

What is Kubeflow
The Kubeflow project is dedicated to making deployments of
machine learning (ML) workflows on Kubernetes: simple,
portable and scalable.
4

Perception: ML Products are mostly about ML
Credit: Hidden Technical Debt of Machine Learning Systems, D. Sculley, et al.
Configuration
Data Collection
Data
Verification
Feature
Extraction
Process
Management Tools
Analysis Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
ML Code
5

Reality: ML Requires DevOps; lots of it
Configuration
Data Collection
Data
Verification
Feature Extraction Process Management
Tools
Analysis Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
ML
Code
Credit: Hidden Technical Debt of Machine Learning Systems, D. Sculley, et al.
6

Kubeflow components
7
Jupyter Notebooks
Workflow Building
Pipelines
Tools
Serving
Metadata
Data Management
Kale
Fairing
TFX
Airflow, +
KF Pipelines
HP Tuning
Tensorboard
KFServing
Seldon Core
TFServing, + Training Operators
Pytorch
XGBoost, +
Tensorflow
Prometheus
Versioning ReproducibilitySecure
Sharing

Platforms /
clouds
GCP AWS IBM CloudAzure OpenShift
Istio
ML tools
PyTorch scikit-learn
Jupyter
TensorFlow
PyTorch
Serving
TensorFlow
Serving
XGBoost
Kubernetes
Argo
Prometheus
Spartakus
Seldon Core
Kubeflow
applications
and
scaffolding
Chainer MPI MXNet
On prem
Jupyter notebook
web app and
controller
Hyperparameter
tuning (Katib)
Kale
Pipelines
Metadata
Training operators:
MPI, MXNet, PyTorch,
TFJob, XGBoost
Kubeflow UI
KFServing
8

Platforms /
clouds
Kubeflow
applications
and
scaffolding
ML tools
Jupyter
TensorFlow XGBoost
Chainer MPI MXNet
Istio
PyTorch
Serving
TensorFlow
Serving
Kubernetes
Argo
Prometheus
Spartakus
Seldon Core
On prem
Jupyter notebook
web app and
controller
Hyperparameter
tuning (Katib)
Kale
Pipelines
Metadata
Kubeflow UI
KFServing
Training operators:
TFJob, XGBoost
9

Platforms /
clouds
ML tools
Jupyter
TensorFlow XGBoost
Kubeflow
applications
and
scaffolding
Chainer MPI MXNet
Istio
PyTorch
Serving
TensorFlow
Serving
Kubernetes
Argo
Prometheus
Spartakus
Seldon Core
On prem
Jupyter notebook
web app and
controller
Hyperparameter
tuning (Katib)
Kale
Pipelines
Metadata
Kubeflow UI
KFServing
Training operators:
TFJob, XGBoost
10

Platforms /
clouds
ML tools
Jupyter
TensorFlow XGBoost
Kubeflow
applications
and
scaffolding
Chainer MPI MXNet
Istio
PyTorch
Serving
TensorFlow
Serving
Kubernetes
Argo
Prometheus
Spartakus
Seldon Core
Jupyter notebook
web app and
controller
Hyperparameter
tuning (Katib)
Kale
Pipelines
Metadata
Kubeflow UI
KFServing
On prem
Training operators:
TFJob, XGBoost
11

ML workflow
Identify
problem and
collect and
analyse data
Choose an ML
algorithm and
code your
model
Experiment
with data and
model training
Tune the model
hyperparamet
ers
Jupyter
Notebook
Katib
TensorFlow
scikit-learn
PyTorch
XGBoost
Jupyter
Notebook
Kale
Pipelines
KFServing
PyTorch
TFServing
Seldon Core
NVIDIA
TensorRT
Serve the
model for
online/batch
prediction
12

Testimonials
● Dyson: “Kubeflow is to data science what a lab notebook is to biomedical
scientists — a way to expedite ideas from the lab to the ‘bedside’ 3x faster, while
ensuring experimental reproducibility.”
● US Bank: “The Kubeflow 1.0 release is a significant milestone as it positions
Kubeflow to be a viable ML Enterprise platform. Kubeflow 1.0 delivers material
productivity enhancements for ML researchers.”
● One Technologies: “With Kubeflow at the heart of our ML platform, our small
company has been able to stack models in production to improve CR, find new
customers, and present the right product to the right customer at the right time.”
13

Testimonials
● GroupBy: “Kubeflow is helping GroupBy in standardizing ML workflows and
simplifying very complicated deployments!”
● Volvo Cars: “Kubeflow provides a seamless interface to a great set of tools
that together manages the complexity of ML workflows and encourages best
practices. The Data Science and Machine Learning teams at Volvo Cars are
able to iterate and deliver reproducible, production grade services with ease.”
14

Kubeflow - The Infra Side
● Install
● Manage
● Secure
● Upgrade

What is GitOps
16
All configuration state is declaratively stored in git.

Imperative vs Declarative
Imperative
1. Create Service
2. Update LoadBalancer
3. Upgrade Deployment

Imperative vs Declarative
Declarative
Desired State (YAML)
K8s
kind: Pod
metadata:
name: mysql
spec:
image: mysql:7.6
apply
etcd

Controller
Spec
(desired)
Status
(real)
Kubernetes
Objects
Controller Pattern - The driver behind declarative APIs
Used everywhere in Kubernetes
Observe
Calculate
Reconcile
Physical ResourcesPhysical ResourcesPhysical Resources
write

Why GitOps?

K8s
etcd
K8s
etcd
Reproducibility
commit 856df4gdf56g4561d1fg564df5g61v6854df
Author: yanniszark <yanniszark@arrikto.com>
Date: Tuesday, Sep 8 11:24:12 2020 +0200
Upgrade MySQL to new version.
K8s
etcd
apply
● Whole configuration state in git, versioned by commits
● Careful! Mutable state still outside of git (e.g., volumes, S3)
○ Need versioning solution for end-to-end reproducibility
○ Arrikto Rok produces data commits for your volumes (e.g., MySQL)

Rollbacks
commit 856df4gdf56g4561d1fg564df5g61v6854df
Date: Tuesday, Sep 8 11:24:12 2020 +0200
Upgrade MySQL to new version.
commit er1f1ef8f1e1rf5641sdfs564d1fsd1f5sd61fgwd
Date: Tuesday, Sep 4 15:24:12 2020 +0200
Increase MySQL read-replicas to 3 for higher
availability.
git log
K8s
etcd
apply
apply
Unhealthy

Auditing
git blame
48f078b0 (Yannis Zarkadas 2020-06-11 41) kind: Deployment
48f078b0 (Yannis Zarkadas 2020-06-11 42) metadata:
48f078b0 (Yannis Zarkadas 2020-06-11 43) name: nginx
48f078b0 (Yannis Zarkadas 2020-06-11 46) spec:
48f078b0 (Stefano Fioravanzo 2020-06-11 47) replicas: 1

Rich Ecosystem
● Collaboration through familiar and battle-tested tools
○ Pull Requests and Code Reviews
● Rich offerings
○ GitHub, GitLab, etc.
● Plenty of integrations
○ GitHub Actions, GitLab Pipelines, etc.
Reuse whatever you already know about git!

GitOps Workflow

GitOps Workflow
Deployer
GitOps repo
commit
kubectl
apply
Desired State (YAML)
kind: Pod
metadata:
name: mysql
spec:
image: mysql:7.6

Deployer
GitOps repo
(downstream)
commit
kubectl
apply
GitOps Workflow
● What about 3rd-party applications?
● Usually, infrastructure configuration is provided by the vendor
● For example, Kubeflow maintains a “manifests” monorepo with all deployment
configurations
manifests repo
(upstream)
Kubeflow
Developer
commit
periodic
rebase

GitOps - Managing Configuration
● How do you manage configuration?
○ Use 3rd-party provided configs
○ Customer changes
○ Update periodically
● Several tools:
○ helm
○ kustomize
○ ...
● Kubeflow uses kustomize
● We (Arrikto) use kustomize for our deployments
kind: Deployment
metadata:
name: redis
namespace: deploy
spec:
template:
spec:
image: gcr.io/redis:6
replicas: 3

Managing Configuration - Helm
● Helm is the most popular tool that uses templating
● Exposes knobs to consumers via values file
● Templating is hard to read
values.yaml Chart
Customer Repo
(downstream)
Vendor Repo
(upstream)

{{ if (or (not .Values.persistence.enabled) (eq .Values.persistence.type "pvc")) }}
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ template "grafana.fullname" . }}
namespace: {{ template "grafana.namespace" . }}
labels:
{{- include "grafana.labels" . | nindent 4 }}
{{- if .Values.labels }}
{{ toYaml .Values.labels | indent 4 }}
{{- end }}
{{- with .Values.annotations }}
annotations:
{{ toYaml . | indent 4 }}
{{- end }}
https://github.com/helm/charts/blob/99805df25da220c379ad609fcb7cf20e5e0d4fc0/stable/grafana/templates/deployment.yaml
Managing Configuration - Templating

└── redis
├── base
│ ├── configmap.yaml
│ ├── kustomization.yaml
│ ├── service.yaml
│ └── statefulset.yaml
Managing Configuration - kustomize
resources:
- configmap.yaml
- service.yaml
- statefulset.yaml
kustomization.yaml
● Base configuration

kind: Deployment
metadata:
name: redis
spec:
template:
spec:
replicas: 1
kustomize build
redis/base
resources:
- configmap.yaml
- service.yaml
- statefulset.yaml
kustomization.yaml

└── redis
├── base
└── overlays
├── deploy
│ ├── kustomization.yaml
│ └── patches
│ └── replicas.yaml
bases:
- ../base
namespace: deploy
patches:
- path: patches/replicas.yaml
kustomization.yaml
● Create overlays (variants) to customize
deployment
kind: Deployment
metadata:
name: redis
spec:
template:
spec:
replicas: 3
patches/replicas.yaml

kind: Deployment
metadata:
name: redis
namespace: deploy
spec:
template:
spec:
replicas: 3
kustomize build
redis/overlays/deploy
bases:
- ../base
namespace: deploy
patches:
- path: patches/replicas.yaml
kustomization.yaml
kind: Deployment
metadata:
name: redis
spec:
template:
spec:
replicas: 3
patches/replicas.yaml

Vendor Repo
(upstream)
Customer Repo
(downstream)
v1
v2
d1
v1
v2

Vendor Repo
(upstream)
Customer Repo
(downstream)
v1
v2
d1
v1
v2
v3

Vendor Repo
(upstream)
Customer Repo
(downstream)
d1
v1
v2
v3
v1
v2
v3
● Update with git rebase
● Separate file == no conflicts

└── redis
├── base
└── overlays
├── deploy
● Powerful customization capabilities
● Rebase from upstream to get new updates
● Customizations in separate folders, no conflicts on rebase
Consumer
customizations
Upstream repo
GitOps repo

● Simplify Kubeflow stack installation, configuration, and management
○ Deploy and manage software in a declarative way
○ Complete visibility of system configuration
● Accelerate the upgrade process by continuously deploying changes to the
cluster
○ Track changes and revert if something goes wrong
● Collaborate better and faster, share knowledge with the whole team
○ Keep using your favorite familiar tools and workflow
Why GitOps in your Kubeflow Deployment
39

Demo
1. Kubernetes Cluster (EKS) on Amazon Web Services
2. Deploy Rok
3. Deploy Kubeflow
4. Update installation from upstream

Security in Kubeflow
“We observed that this attack effected
on tens of Kubernetes clusters.”

Multi-User Isolation
Authentication?
Authorization?

Authentication using OIDC Protocol
● Open & Standardized OAuth Flow
● Objective: Get the User’s Identity (username, groups)
● Popular and Secure

Identity Provider
LDAP / AD
Static
Password
File
External
IdP
(Google, LinkedIn,
…)
OIDC Provider
Interface

Authorization
● Authorization with Role Based Access Control (RBAC)
● Commit RBAC resources in git for reproducibility
Endpoints
RBAC
Resources Verbs
GET /apis/kubeflow.org/v1/notebooks/{name} Notebooks GET
GET /apis/kubeflow.org/v1/notebooks Notebooks LIST
POST /apis/kubeflow.org/v1/notebooks Notebooks CREATE
DELETE /apis/kubeflow.org/v1/notebooks/{name} Notebooks DELETE
GET /apis/kubeflow.org/v1/experiments/{name} Experiments GET
Can USER do ACTION on RESOURCE in NAMESPACE?

Handling Credentials
● Credentials are kept in Secrets
● Injected into Pods at runtime with PodDefaults
● Applications expect to find secrets in files or environment variables

Auth Guidelines for Kubeflow
● Guidelines for secure applications in Kubeflow
https://github.com/kubeflow/community/blob/3357efef4947297026111df17e468d9204fa2061/guidelines/auth.md

CI/CD for ML
How can data scientists continually improve
and validate models?
● Develop models and pipelines in Jupyter
● Convert notebook to pipeline using Kale
● Run pipeline using Kubeflow Pipelines
● Explore and debug pipeline using Rok
Develop
(Jupyter)
Explore Pipeline
(Rok)
Create Pipeline
(Kale)
Run Pipeline
(KF Pipelines)
N2P CUJ
48

This workshop will focus on two essential
aspects:
• Low barrier to entry: deploy a Jupyter
Notebook to Kubeflow Pipelines in the
Cloud using a fully GUI-based approach
• Reproducibility: automatic data
versioning to enable reproducibility and
better collaboration between data
scientists
Data Science with Kubeflow
Building
a
Model
Logging
Data
Ingestion
Data
Analysis
Data
Transform
-ation
Data
Validation
Data
Splitting
Trainer
Model
Validation
Training
At Scale
Roll-out Serving Monitoring
Kubeflow Pipelines exists because Data Science and ML are inherently pipeline processes
49

This workshop will focus on two essential
aspects:
• Low barrier to entry: deploy a Jupyter
Notebook to Kubeflow Pipelines in the
Cloud using a fully GUI-based approach
• Reproducibility: automatic data
versioning to enable reproducibility and
better collaboration between data
scientists
Data Science with Kubeflow
Building
a
Model
Logging
Data
Ingestion
Data
Analysis
Data
Transform
-ation
Data
Validation
Data
Splitting
Trainer
Model
Validation
Training
At Scale
Roll-out Serving Monitoring
Kubeflow Pipelines exists because Data Science and ML are inherently pipeline processes
50

Benefits of running a Notebook as a Pipeline
● The steps of the workflow are clearly defined
● Parallelization & isolation
○ Hyperparameter tuning
● Data versioning
● Different infrastructure requirements
○ Different hardware (GPU/CPU)
51

Before
Amend your ML code?
Write your ML code
Create Docker images
Write DSL KFP code
Compile DSL KFP
Upload pipeline to KFP
Run the Pipeline
Workflow
52

After
Amend your ML code?
Write your ML code
Tag your Notebook cells
Run the Pipeline at the click of a button
Just edit your Notebook!
Before
Amend your ML code?
Write your ML code
Write DSL KFP code
Compile DSL KFP
Run the Pipeline
Workflow
53

After
Amend your ML code?
Write your ML code
Tag your Notebook cells
Run the Pipeline at the click of a button
Just edit your Notebook!
Before
Amend your ML code?
Write your ML code
Write DSL KFP code
Compile DSL KFP
Run the Pipeline
Workflow
A Data Scientist can now
reduce the time taken to write
ML code and run a pipeline by
70%.
That means you can now run
3x as many experiments as
you did before.

What that really means is that
you can deliver work faster to
the business and drive more
revenue
54

Hyperparameter optimization
The two ways of life
● Change the parameters manually
● Use Katib
55

What is Katib
Katib is a Kubernetes-based system for Hyperparameter Tuning and
Neural Architecture Search. It supports a number of ML frameworks,
including TensorFlow, Apache MXNet, PyTorch, XGBoost, and others.
56

Hyperparameter optimization
Combining the N2P CUJ with Katib
● Configure parameters, search algorithm, and objectives
using a GUI
● Start HP tuning with the click of a button
● Reproducibility of every pipeline and every step
● Run Katib Trials as Pipelines
● Complete visibility of every different Katib Trial
● Caching for faster computation
57

A data science journey
58

Agenda
Convert notebook to a
Kubeflow pipeline
Explore Kubeflow
components
Explore the ML code of the dog
breed identification example
Explore the accuracy of
the various models
Optimize a model with
hyperparameter tuning
Explore the results
of HP tuning
21 3
54 6
Go to arrik.to/demowfhp to find the
Codelab with the step-by-step
instructions for this tutorial
59

Agenda
Kubeflow pipeline
Explore Kubeflow
components
the various models
Explore the results
of HP tuning
21 3
54 6
60

Agenda
Explore Kubeflow
components
the various models
Explore the results
of HP tuning
21 3
54 6
Kubeflow pipeline
61

Agenda
Kubeflow pipeline
Explore Kubeflow
components
the various models
Explore the results
of HP tuning
21 3
54 6
62

KALE – Kubeflow Automated Pipelines Engine
● Python package + JupyterLab extension
● Convert a Jupyter Notebook to a KFP workflow
● No need for Kubeflow SDK
Annotated
Jupyter Notebook
Kale
Conversion Engine
63

Kale Modules
Parse Analyze Marshal Generate
Derive pipeline
structure
Identify
dependencies
Inject data objects Generate & deploy
pipeline
64

Contribute
github.com/kubeflow-kale
65

TFDV TFTransform TFDV Estimators TFΜΑ TFServing
Katib
Tuner
Arrikto Rok
66

Arrikto Rok
Data Versioning, Packaging, and Sharing
Across teams and cloud boundaries for complete Reproducibility, Provenance, and Portability
ProductionExperimentation Training
Any Storage Any Storage Any Storage
Data-aware
PVCs
Data-aware
PVCs
Data-aware
PVCs
Arrikto Arrikto Arrikto
CSI CSI CSI
67

Data Lake
Step 1 Step 2 Step 3
1. Download data from
Lake
2. Store it locally
3. Do initial analysis
4. Upload data to Lake
Lake
6. Store it locally
7. Transform data
8. Upload to Lake
Lake
10. Store it locally
11. Train model
12. Upload
Model Building without Data Management
71

1. Clone disk from
snapshot
2. Do initial analysis
3. Snapshot
4. Clone disk of Step 1
5. Transform data
7. Clone disk of Step 2
8. Train model
Rok
6. Snapshot 9. Snapshot
Object Store
Model Building with Local Data Management (Rok)
72

Arrikto
Object Store
Arrikto
Object Store
Arrikto
Object Store
Location 2
Pipeline 2: Start after Step 3 of Pipeline 1
Pipeline 3: Reproduce Pipeline 1
Location 1
Pipeline 1
Sync State & Data
73

Validation Preprocessing Training
Data
Cloned
Data
Evaluation DeploymentTraining
Fail
Validated
Data
Preprocessed
Data
Trained
Model
Evaluated
Model
Deployed
Model
Arrikto Rok
74

Agenda
Kubeflow pipeline
Explore Kubeflow
components
the various models
Explore the results
of HP tuning
21 3
54 6
75

Agenda
Kubeflow pipeline
Explore Kubeflow
components
the various models
Explore the results
of HP tuning
21 3
54 6
76

Agenda
Kubeflow pipeline
Explore Kubeflow
components
the various models
Explore the results
of HP tuning
21 3
54 6
77

What have we achieved in this tutorial?
● Streamline your ML workflows using intuitive UIs
● Exploit the caching feature to give a boost to your pipeline runs
● Run a pipeline-based hyperparameter tuning workflow starting from your
Jupyter Notebook
● Use Kale as a workflow tool to orchestrate Katib and Kubeflow Pipelines
experiments
● Simplify the deployment and management of Kubeflow using GitOps
● Accelerate the time to production
● Collaborate faster and more easily in a secure and isolated manner
Summary
78

Simplify. Accelerate. Collaborate. arrik.to/odsc20 79
Just a small sample of
community contributions
● Jupyter manager UI
● Pipelines volume support
● MiniKF
● Auth with Istio + Dex
● On-premise installation
● Linux Kernel

Simplify. Accelerate. Collaborate. arrik.to/odsc20 80
Community
Kubeflow is open
● Open community
● Open design
● Open source
● Open to ideas
Get involved
● github.com/kubeflow
● kubeflow.slack.com
● @kubeflow
● kubeflow-discuss@googlegroups.com
● Community call on Tuesdays

Thank You!
More Info
arrik.to/odsc20
Email Address:
stefano@arrikto.com
yanniszark@arrikto.com
company/arrikto
Arrikto Arrikto
Arrikto

Yannis Zarkadas. Enterprise data science workflows on kubeflow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Yannis Zarkadas. Enterprise data science workflows on kubeflow

Similar to Yannis Zarkadas. Enterprise data science workflows on kubeflow (20)

Recently uploaded

Recently uploaded (20)

Yannis Zarkadas. Enterprise data science workflows on kubeflow