Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow

Kostiantyn Bokhan, PhD
CD4ML based on Azure and Kubeflow
1

Agenda
1. Introduction to CD4ML
2. Kubeflow
3. Use cases of kubeflow
4. Installing Kubeflow on Azure - tips and tricks
2

Introduction to CD4ML
Continuous Delivery for Machine Learning (CD4ML) is a software engineering
approach in which a cross-functional team produces machine learning
applications based on code, data, and models in small and safe increments
that can be reproduced and reliably released at any time, in short adaptation
cycles.
Danilo Sato, Arif Wider, Christoph Windheuser. Continuous Delivery for Machine
Learning: - https://martinfowler.com/articles/cd4ml.html
3

MLOps: Continuous delivery and automation pipelines in machine learning
MLOps is an ML engineering culture and practice that aims at unifying ML
system development (Dev) and ML system operation (Ops)
Google: -
https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-
in-machine-learning
4

MDLC
Model Development
Life Cycle
DLC
Data
Life Cycle
7

Based on https://martinfowler.com/articles/cd4ml.html
Data
preparation
Model
Building
Model
Evaluation
Productionize
Model
Testing Deployment
Monitoring
and
Observability
Experimentation
Labeling
code
Training
code
Evaluating
code
Test
code
Application
code
Candidate
models
Chosen
models
Productionized
models
model
training
data
test
data
production
data
validation /
test data
metrics
code and
model in
production
CodeModelData
raw
data
8

https://martinfowler.com/articles/cd4ml.html
9

Continuous ML
CD4ML
Incremental
(continual) ML
Auto ML
10

Pachyderm is an end to end model
versioning framework to help create
reproducible pipeline definitions, with
each processing step packaged in a Docker
container
Pachyderm
Amazon SageMaker is a fully managed
machine learning service. Developers can
quickly and easily build and train machine
learning models, and then directly deploy
them into a production-ready hosted
environment.
Amazon SageMaker
MLFlow is an open source platform to
manage the ML lifecycle, including
experimentation, reproducibility,
deployment, and a central model registry.
MLflow currently offers four components
MLFlow
The Kubeflow project is dedicated to
making deployments of machine learning
(ML) workflows on Kubernetes simple,
portable and scalable.
Kubeflow
AzureML - empower developers with a
wide range of productive experiences for
building, training, and deploying machine
learning models faster. Accelerate time to
market and foster team collaboration with
industry-leading MLOps—DevOps for
machine learning. Innovate on a secure,
trusted platform, designed for responsible
ML
AzureML
Lightbend, Streamlio’s Community
Edition, Polyaxon, MFlow, Daitaku, Domino
Data Science Platform, ParallelM MCenter,
Seldon, MLeap
Other
11

Kubeflow
Kubeflow Pipelines is a platform for
building and deploying portable and
scalable end-to-end ML workflows, based
on containers.
Kubeflow Pipelines
Use Katib for automated tuning of your
machine learning (ML) model’s
hyperparameters and architecture as well
as implementing AutoML at all.
Katib
The Jupyter Notebook is an open-source
web application that allows you to create
and share documents that contain live
code, equations, visualizations and
narrative text. Uses include: data cleaning
and transformation, numerical simulation,
statistical modeling, data visualization,
machine learning, and much more
The Jupyter Notebook
Kubeflow Fairing streamlines the process
of building, training, and deploying
machine learning (ML) training jobs in a
hybrid cloud environment.
Fairing
Kale is a Python package that aims at
automatically deploy a general purpose
Jupyter Notebook as a running Kubeflow
Pipelines instance, without requiring the
use the specific KFP DSL
Kale
The goal of the Metadata project is to
track and manage metadata of machine
learning workflows in Kubeflow.
Metadata
12

Kubeflow
TFJob is a Kubernetes custom resource
that you can use to run TensorFlow
training jobs on Kubernetes including
distributed jobs.
TFJob
Seldon is an open source platform for
deploying machine learning models on a
Kubernetes cluster.
Seldon
You can create and manage PyTorch jobs
like other built-in resources in Kubernetes
PyTorch jobs
The NVIDIA TensorRT Inference Server
provides a cloud inferencing solution
optimized for NVIDIA GPUs. The server
provides an inference service via an HTTP
or GRPC endpoint, allowing remote clients
to request inferencing for any model being
managed by the server.
The NVIDIA TensorRT Inference
Server
TensorFlow Serving is a flexible,
high-performance serving system for
machine learning models, designed for
production environments.
TensorFlow Serving
13

Kubeflow
https://www.kubeflow.org/docs/started/kubeflow-overview/ 14

Uses cases - the project background
AI for a Worldwide Logistic Platform

Object
detection
OCR
Language
modeling
NLP
Anomaly detection
Document matching
Template matchingPattern
recogntion
Segmentation
Classification
Mobile Apps IOT Apps SaaS

Goals of implementing CD4ML
17
● Integrated Infrastructure for AI experiments based on Jupyter Notebook service
● Automatization of all stages of deep machine learning development in scale:
○ Preprocessing, Dataset preparation, Augmentation
○ Model training and verification
○ Leveraging Automl:
■ Neural architecture search based on AutoKears
■ Training several models simultaneously
○ Optimization of model hyperparameters that is the most frequent task
● Tracking and analysis the results obtained, model versioning and metadata tracking
● Model as a service and Model continuous delivery

Uses cases
Prod
Staging
Dev
Mobile
Labeling
Experiments with Jupyter Notebook
18
Mb

Uses cases
Model building, training and validation
19
Mb

Uses cases
CI / CD based on the Kubeﬂow
Dev
Prod
Staging
Mobile
Labeling
20
Mb

Uses cases
CI / CD based on the Kubeﬂow for embedded (Jetson Nano)
Dev
Prod
Staging
Labeling
21
Mb

Uses cases
CI / CD based on the Kubeﬂow for mobile (Android & iOs)
Dev
Staging
Labeling
22
Mb

Uses cases
Mobile inference testing
23

Uses cases
Debug data analytics
25

Uses cases
Labeling quality and contracts testing
26
Labeling

Deployment Kubeflow on Azure - issues
Istio
KFserving
Knative
Uninstalling of Kubeflow
● Istio is outdated
● istioctl is not supported
● KFserving is outdated
● Tensorflow 2 is not supported
● It is impossible to override version of tensorflow
● Knative is outdated
● Embedded Knative is not support fresh versions of istio
● Istio deployment can be deleted
● Kubeflow can’t be uninstalled properly
27

Installing Kubeflow on Azure - tips and tricks
Deployment stages
28
Creating
AKS
1.16
Creating
& linking
ACR
Installing
Istio
1.5
Deploying
KNative
0.18
Installing
Kubeflow
1.1.0
Deploying
kfserving
0.4.0
Deploying
other
components

Creating AKS
• Kubeflow is not fully tested with
kubernetes versions > 1.16
• nodepool-name examples:
npdevcpu - only for CPU tasks:
nodeSelector."agentpool"=npdevcpu
npdevstorage: only for storage services, e.g.
Rook, Minio etc
nodeSelector."agentpool"=npdevstorage
29
az aks create --resource-group aigroup
--name aicluster
--node-count 3
--vm-set-type VirtualMachineScaleSets
--nodepool-name npdevcpu
--load-balancer-sku standard
--kubernetes-version 1.16.15
--node-vm-size Standard_DS3_v2
--generate-ssh-keys
--service-principal "XXXXX"
--client-secret "XXXXX"

Adding GPU node pool and install Nvidia drivers
npdevgpu - only for GPU tasks:
nodeSelector."agentpool"=npdevgpu
nvidia-device-plugin-ds.yaml is can be found in
the Azure AKS dcumentation
30
> az aks nodepool add
--cluster-name aicluster
--name npdevgpu
--resource-group aigroup
--node-count 3
--node-vm-size Standard_NC6
> kubectl create namespace gpu-resources
> kubectl apply -f nvidia-device-plugin-ds.yaml

Creating an ACR and linking with the AKS
Note: if you are not a subscription owner you
can’t link the ACR with your AKS
31
# assumes ACR Admin Account is enabled
ACR_NAME=aiclusterRegistry.azurecr.io
ACR_UNAME=tokenname
ACR_PASSWD=tokenpassword
# Creating the secret
kubectl -n yournamespace create secret
docker-registry acr-secret
--docker-server=$ACR_NAME
--docker-username=$ACR_UNAME
--docker-password=$ACR_PASSWD
--docker-email=ignorethis@email.com
# Patching default serviceaccount
kubectl -n yournamespace patch serviceaccount default
-p '{"imagePullSecrets": [{"name": "acr-secret"}]}'
# Creating an ACR
az acr create --resource-group aigroup
--name aiclusterRegistry
--sku Premium
# Creating token
az acr token create -n MyToken -r aiclusterRegistry
--scope-map _repositories_admin

Installing Istio
Note: Kubeflow is not support version of Istio >
1.5. Istio config should consider knative
requirements for istio
Istio can be installed with:
• istioctl tool
• helm
• Istio operator
32
# creating a namespace
kubectl create namespace istio-system --save-config
# installing istio
istioctl manifest apply --set profile=default
--set components.policy.enabled=true
--set addonComponents.kiali.enabled=true
--set addonComponents.grafana.enabled=true
--set addonComponents.tracing.enabled=true
--set values.global.defaultNodeSelector.
"agentpool"=npdevcpu
--set values.global.useMCP=false
--set values.global.proxy.autoInject=disabled

Installing KNative
Note: KNative requirements for Istio are
outdated due to changes of config parameters of
Istio
33
kubectl apply
--filename https://github.com/knative/serving/
releases/download/v0.18.0/serving-crds.yaml
kubectl apply
releases/download/v0.18.0/serving-core.yaml
kubectl apply
--filename https://github.com/knative/net-istio/
releases/download/v0.18.0/release.yaml
# Optional, please refer the installation guide
kubectl apply
releases/download/v0.18.0/serving-default-domain.yaml

Installing Kubeflow
Note: Due to some embedded components are
installed separately they should be removed from
the Kubeflow manifest -
kfctl_k8s_istio.v1.1.0.yaml:
• istio-stack
• knative
• kfserving
Important! A folder {clastername} created by kfctl
should be kept for uninstalling and reconfiguration
reasons
34
...
applications:
...
- kustomizeConfig:
repoRef:
name: manifests
path: application/v3
name: application
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/istio-1-3-1-stack
name: istio-stack
- kustomizeConfig:
repoRef:
name: manifests
path:
stacks/kubernetes/application/cluster-local-gateway-1-3-1
name: cluster-local-gateway
...
# Installing kubeflow
kfctl apply -V -f kfctl_k8s_istio_fixed.v1.1.0.yaml
# Deleting kubeflow
kfctl delete -V -f kfctl_k8s_istio_fixed.v1.1.0.yaml

35
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Go to http://localhost:8080

Kfserving example - kfserving-tenzorflow-2.yaml
36
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
name: "mnist"
spec:
default:
predictor:
tensorflow:
runtimeVersion: 2.3.0
storageUri: "https://modelstorage.blob.core.windows.net/mnist/"
> kubectl -n mnist apply -f kfserving-tenzorflow-2.yaml

Kfserving example - kfserving-tenzorflow-2.yaml
37
> kubectl label namespace mnist knative-eventing-injection=enabled
> kubectl label namespace mnist istio-injection=enabled
> kubectl label namespace mnist serving.kubeflow.org/inferenceservice=enabled
> kubectl label namespace mnist katib-metricscollector-injection=enabled
> kubectl -n mnist apply -f kfserving-tenzorflow-2.yaml

38
> curl -v http://mnist-predictor-default.mnist.1.1.1.1.xip.io/v1/models/mnist/metadata
{
"model_spec":{
"name": "mnist",
"signature_name": "",
"version": "1"
}
,
"metadata": {"signature_def": {
"signature_def": {
"serving_default": {
"inputs": {
"inputs": {
"dtype": "DT_STRING",
"tensor_shape": {
"dim": [],
"unknown_rank": true
},
"name": "tf_example:0"
}
},
...

Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow

Similar to Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow (20)

More from IT Arena

More from IT Arena (20)

Recently uploaded

Recently uploaded (20)

Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow