Kostiantyn Bokhan, a technical lead at N-IX, focuses on data science projects. He leads data science projects in several areas: Computer vision, NLP, and signal processing as well as consults clients regarding digital transformations with AI. When free, he conducts research in the deep machine learning area. Kostiantyn has been an associate professor and faculty member of several universities since 2002. His research focuses on machine learning, deep learning, signal, and image processing. He received a PhD degree in network and telecommunications systems with research in digital signal processing in 2013. He has served on the scientific committees and review boards of several conferences.
Speech Overview:
Applying machine learning to make business applications and services intelligent is more than just training models and serving them. It requires implementing end-to-end and continuously repeatable cycles of training, testing, deploying, monitoring, and operating the models. Continuous delivery for machine learning (CD4ML) is a technique that enables reliable end-to-end cycles of development, deploying, and monitoring machine learning models. There are a lot of tools and frameworks that can be used to implement CD4ML. One of them is Kubeflow. Our experience of using Kubeflow for implementing CD4ML for the manufacturing area based on Azure Kubernetes service will be described in this speech.
2. Agenda
1. Introduction to CD4ML
2. Kubeflow
3. Use cases of kubeflow
4. Installing Kubeflow on Azure - tips and tricks
2
3. Introduction to CD4ML
Continuous Delivery for Machine Learning (CD4ML) is a software engineering
approach in which a cross-functional team produces machine learning
applications based on code, data, and models in small and safe increments
that can be reproduced and reliably released at any time, in short adaptation
cycles.
Danilo Sato, Arif Wider, Christoph Windheuser. Continuous Delivery for Machine
Learning: - https://martinfowler.com/articles/cd4ml.html
3
4. Introduction to CD4ML
MLOps: Continuous delivery and automation pipelines in machine learning
MLOps is an ML engineering culture and practice that aims at unifying ML
system development (Dev) and ML system operation (Ops)
Google: -
https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-
in-machine-learning
4
8. Introduction to CD4ML
Based on https://martinfowler.com/articles/cd4ml.html
Data
preparation
Model
Building
Model
Evaluation
Productionize
Model
Testing Deployment
Monitoring
and
Observability
Experimentation
Labeling
code
Training
code
Evaluating
code
Test
code
Application
code
Candidate
models
Chosen
models
Productionized
models
model
training
data
test
data
production
data
validation /
test data
metrics
code and
model in
production
CodeModelData
raw
data
8
11. Introduction to CD4ML
Pachyderm is an end to end model
versioning framework to help create
reproducible pipeline definitions, with
each processing step packaged in a Docker
container
Pachyderm
Amazon SageMaker is a fully managed
machine learning service. Developers can
quickly and easily build and train machine
learning models, and then directly deploy
them into a production-ready hosted
environment.
Amazon SageMaker
MLFlow is an open source platform to
manage the ML lifecycle, including
experimentation, reproducibility,
deployment, and a central model registry.
MLflow currently offers four components
MLFlow
The Kubeflow project is dedicated to
making deployments of machine learning
(ML) workflows on Kubernetes simple,
portable and scalable.
Kubeflow
AzureML - empower developers with a
wide range of productive experiences for
building, training, and deploying machine
learning models faster. Accelerate time to
market and foster team collaboration with
industry-leading MLOps—DevOps for
machine learning. Innovate on a secure,
trusted platform, designed for responsible
ML
AzureML
Lightbend, Streamlio’s Community
Edition, Polyaxon, MFlow, Daitaku, Domino
Data Science Platform, ParallelM MCenter,
Seldon, MLeap
Other
11
12. Kubeflow
Kubeflow Pipelines is a platform for
building and deploying portable and
scalable end-to-end ML workflows, based
on containers.
Kubeflow Pipelines
Use Katib for automated tuning of your
machine learning (ML) model’s
hyperparameters and architecture as well
as implementing AutoML at all.
Katib
The Jupyter Notebook is an open-source
web application that allows you to create
and share documents that contain live
code, equations, visualizations and
narrative text. Uses include: data cleaning
and transformation, numerical simulation,
statistical modeling, data visualization,
machine learning, and much more
The Jupyter Notebook
Kubeflow Fairing streamlines the process
of building, training, and deploying
machine learning (ML) training jobs in a
hybrid cloud environment.
Fairing
Kale is a Python package that aims at
automatically deploy a general purpose
Jupyter Notebook as a running Kubeflow
Pipelines instance, without requiring the
use the specific KFP DSL
Kale
The goal of the Metadata project is to
track and manage metadata of machine
learning workflows in Kubeflow.
Metadata
12
13. Kubeflow
TFJob is a Kubernetes custom resource
that you can use to run TensorFlow
training jobs on Kubernetes including
distributed jobs.
TFJob
Seldon is an open source platform for
deploying machine learning models on a
Kubernetes cluster.
Seldon
You can create and manage PyTorch jobs
like other built-in resources in Kubernetes
PyTorch jobs
The NVIDIA TensorRT Inference Server
provides a cloud inferencing solution
optimized for NVIDIA GPUs. The server
provides an inference service via an HTTP
or GRPC endpoint, allowing remote clients
to request inferencing for any model being
managed by the server.
The NVIDIA TensorRT Inference
Server
TensorFlow Serving is a flexible,
high-performance serving system for
machine learning models, designed for
production environments.
TensorFlow Serving
13
15. Uses cases - the project background
AI for a Worldwide Logistic Platform
16. Uses cases - the project background
Object
detection
OCR
Language
modeling
NLP
Anomaly detection
Document matching
Template matchingPattern
recogntion
Segmentation
Classification
Mobile Apps IOT Apps SaaS
17. Uses cases - the project background
Goals of implementing CD4ML
17
● Integrated Infrastructure for AI experiments based on Jupyter Notebook service
● Automatization of all stages of deep machine learning development in scale:
○ Preprocessing, Dataset preparation, Augmentation
○ Model training and verification
○ Leveraging Automl:
■ Neural architecture search based on AutoKears
■ Training several models simultaneously
○ Optimization of model hyperparameters that is the most frequent task
● Tracking and analysis the results obtained, model versioning and metadata tracking
● Model as a service and Model continuous delivery
27. Deployment Kubeflow on Azure - issues
Istio
KFserving
Knative
Uninstalling of Kubeflow
● Istio is outdated
● istioctl is not supported
● KFserving is outdated
● Tensorflow 2 is not supported
● It is impossible to override version of tensorflow
● Knative is outdated
● Embedded Knative is not support fresh versions of istio
● Istio deployment can be deleted
● Kubeflow can’t be uninstalled properly
27
28. Installing Kubeflow on Azure - tips and tricks
Deployment stages
28
Creating
AKS
1.16
Creating
& linking
ACR
Installing
Istio
1.5
Deploying
KNative
0.18
Installing
Kubeflow
1.1.0
Deploying
kfserving
0.4.0
Deploying
other
components
29. Installing Kubeflow on Azure - tips and tricks
Creating AKS
• Kubeflow is not fully tested with
kubernetes versions > 1.16
• nodepool-name examples:
npdevcpu - only for CPU tasks:
nodeSelector."agentpool"=npdevcpu
npdevstorage: only for storage services, e.g.
Rook, Minio etc
nodeSelector."agentpool"=npdevstorage
29
az aks create --resource-group aigroup
--name aicluster
--node-count 3
--vm-set-type VirtualMachineScaleSets
--nodepool-name npdevcpu
--load-balancer-sku standard
--kubernetes-version 1.16.15
--node-vm-size Standard_DS3_v2
--generate-ssh-keys
--service-principal "XXXXX"
--client-secret "XXXXX"
30. Installing Kubeflow on Azure - tips and tricks
Adding GPU node pool and install Nvidia drivers
npdevgpu - only for GPU tasks:
nodeSelector."agentpool"=npdevgpu
nvidia-device-plugin-ds.yaml is can be found in
the Azure AKS dcumentation
30
> az aks nodepool add
--cluster-name aicluster
--name npdevgpu
--resource-group aigroup
--node-count 3
--node-vm-size Standard_NC6
> kubectl create namespace gpu-resources
> kubectl apply -f nvidia-device-plugin-ds.yaml
31. Installing Kubeflow on Azure - tips and tricks
Creating an ACR and linking with the AKS
Note: if you are not a subscription owner you
can’t link the ACR with your AKS
31
# assumes ACR Admin Account is enabled
ACR_NAME=aiclusterRegistry.azurecr.io
ACR_UNAME=tokenname
ACR_PASSWD=tokenpassword
# Creating the secret
kubectl -n yournamespace create secret
docker-registry acr-secret
--docker-server=$ACR_NAME
--docker-username=$ACR_UNAME
--docker-password=$ACR_PASSWD
--docker-email=ignorethis@email.com
# Patching default serviceaccount
kubectl -n yournamespace patch serviceaccount default
-p '{"imagePullSecrets": [{"name": "acr-secret"}]}'
# Creating an ACR
az acr create --resource-group aigroup
--name aiclusterRegistry
--sku Premium
# Creating token
az acr token create -n MyToken -r aiclusterRegistry
--scope-map _repositories_admin
32. Installing Kubeflow on Azure - tips and tricks
Installing Istio
Note: Kubeflow is not support version of Istio >
1.5. Istio config should consider knative
requirements for istio
Istio can be installed with:
• istioctl tool
• helm
• Istio operator
32
# creating a namespace
kubectl create namespace istio-system --save-config
# installing istio
istioctl manifest apply --set profile=default
--set components.policy.enabled=true
--set addonComponents.kiali.enabled=true
--set addonComponents.grafana.enabled=true
--set addonComponents.tracing.enabled=true
--set values.global.defaultNodeSelector.
"agentpool"=npdevcpu
--set values.global.useMCP=false
--set values.global.proxy.autoInject=disabled
33. Installing Kubeflow on Azure - tips and tricks
Installing KNative
Note: KNative requirements for Istio are
outdated due to changes of config parameters of
Istio
33
kubectl apply
--filename https://github.com/knative/serving/
releases/download/v0.18.0/serving-crds.yaml
kubectl apply
--filename https://github.com/knative/serving/
releases/download/v0.18.0/serving-core.yaml
kubectl apply
--filename https://github.com/knative/net-istio/
releases/download/v0.18.0/release.yaml
# Optional, please refer the installation guide
kubectl apply
--filename https://github.com/knative/serving/
releases/download/v0.18.0/serving-default-domain.yaml
34. Installing Kubeflow on Azure - tips and tricks
Installing Kubeflow
Note: Due to some embedded components are
installed separately they should be removed from
the Kubeflow manifest -
kfctl_k8s_istio.v1.1.0.yaml:
• istio-stack
• knative
• kfserving
Important! A folder {clastername} created by kfctl
should be kept for uninstalling and reconfiguration
reasons
34
...
applications:
...
- kustomizeConfig:
repoRef:
name: manifests
path: application/v3
name: application
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/istio-1-3-1-stack
name: istio-stack
- kustomizeConfig:
repoRef:
name: manifests
path:
stacks/kubernetes/application/cluster-local-gateway-1-3-1
name: cluster-local-gateway
...
# Installing kubeflow
kfctl apply -V -f kfctl_k8s_istio_fixed.v1.1.0.yaml
# Deleting kubeflow
kfctl delete -V -f kfctl_k8s_istio_fixed.v1.1.0.yaml
35. Installing Kubeflow on Azure - tips and tricks
35
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Go to http://localhost:8080