Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]

Hybrid Cloud, Kubeflow and Tensorflow
Extended [TFX]
Session by:

Hybrid Cloud, Kubeflow and TFX
Animesh Singh
Chief Architect
Data and AI Platform
CODAIT
Tommy Li
Software Engineer
CODAIT
Pete MacKinnon
Principal Software Engineer,
RedHat AICoE

Center for Open Source
Data and AI
Technologies (CODAIT)
Code – Build and improve practical frameworks to
enable more developers to realize immediate
value.
Content – Showcase solutions for complex and
real-world AI problems.
Community – Bring developers and data
scientists to engage with IBM
Improving Enterprise AI lifecycle in
Open Source
•  Team contributes to over 10 open source projects
•  Committers in Kubeflow, Spark, Tensorflow, PyTorch, ONNX…
•  17 committers and many contributors in Apache projects
•  Speakers at over 100 conferences, meetups, unconferences and more
CODAIT
codait.org
↳ codait.org

The Machine Learning Workflow
4

And the ML workflow spans teams …

…and is much more complex….
Data
cleansing
Data
analysis
Data
transformation
Data
validation
Data
splitting
Data
prep
Building
a model
Model
validation
Training
at scale
Model
creation
Deploying Serving
Monitoring &
Logging
Finetune &
improvements
Rollout
Training
optimization Model
Model
Data
Data
Data
ingestion
EdgeCloud

Hybrid Clouds, TFX and Kubeflow Pipelines
Hybrid
Cloud

Hybrid Cloud
Definition:
A cloud that is:
•  Inclusive of on-prem and public
•  Multicloud
•  Open
•  Secure
•  Managed
To scale enterprise workloads across the globe

Compute Network Storage
{Infrastructure
Cloud / On-Prem
10

{Infrastructure
Cloud / On-Prem
{Platform Security Kubernetes DevOps CFApp Services
11

{Infrastructure
Cloud / On-Prem
{Platform
Databases Analytics Governance
{Data
Security Kubernetes DevOps CFApp Services
12

{Infrastructure
Cloud / On-Prem
{Platform
{Data
Watson TensorFlow Machine learning
{AI
13

{Infrastructure
Cloud / On-Prem
{Platform
TensorFlow Machine learning
14
Watson
{Data
{AI

{Infrastructure
Cloud / On-Prem
{Platform
TensorFlow Machine learning
Public Cloud
Private Cloud
{Infrastructure
{Platform
15
Watson
{Data
{AI

Private
Cloud

OpenShift is our Private Cloud for ML
Workloads
EXISTING
AUTOMATION
TOOLSETS
SCM
(GIT)
CI/CD
SERVICE LAYER
PERSISTEN
T
STORAGE
REGISTRY
RHEL
NODE
c
RHEL
NODE
RHEL
NODE
RHEL
NODE
RHEL
NODE
RHEL
NODE
C
C
C C
C
C
C CC C
RED HAT
ENTERPRISE LINUX
MASTER
API/AUTHENTICATION
DATA STORE
SCHEDULER
HEALTH/SCALING
PHYSICAL VIRTUAL PRIVATE PUBLIC HYBRID
DATA SCIENTIST
ML deployed across
clouds, data center,
and edge
ML services,
load balanced
and scaled
ML microservices
scheduled and
orchestrated on
shared resources
Best of SDLC
ML in Production
17

Enterprise
App
ML
Training
Data
Processing
ML
Inference
CI/CD
Pipelines
Data
Engineer
Data Scientist
Infra
Engineer
App
Developer
Red Hat
OpenShift Container Storage
Baremetal
Red Hat
OpenShift Container Platform
Persona Key
Red Hat
OpenStack
Platform
Multi cloud gateway
Analytics
RDBMS
Discovery
Ceph
OPEN DATA HUB
AI PLATFORM POWERED BY OPEN
SOURCE
Data Ingest
18
We are working across the Data and AI
Lifecycle

The Open Data Hub Project
●  OpenDataHub.io
●  Meta-operator that integrates best open source AI/ML/Data projects
●  Blueprint architecture for AI/ML on OpenShift
https://opendatahub.io/docs/architecture.html
Data
Acquisition & Preparation
ML Model
Selection, Training, Testing
ML Model Deployment in
App. Dev. Process

Public
Cloud

IBM Cloud Architecture
Infrastructure
X86, Power CPU, GPU Compute Sleds Programmable Mesh Network Flash & Spinning Storage
Container Orchestration & Networking: Kubernetes

Red Hat OpenShift on IBM Cloud
Armada API
Carrier
Cluster
Workers
Master Master
Cluster
Workers

Armada API
Carrier
Cluster
Workers
Master Master
Cluster
Workers
RH OpenShift
Carrier
Cluster
Workers
Master Master
Cluster
Workers

Ubuntu Linux
ARMADA API
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
CARRIER
WORKER
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
Containerd
Kubelet
(Community)
Calico Agent
CLUSTER
WORKER

Ubuntu Linux
ARMADA API
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
CARRIER
WORKER
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
Containerd
Kubelet
(Community)
Calico Agent
CLUSTER
WORKER
RHEL
REDHAT
CARRIER
CRI-O
Kubelet
(OpenShift)
Calico Agent
RHEL
CRI-O
Kubelet
(OpenShift)
Calico Agent
REDHAT
WORKER

Kubeflow
Pipelines

Distributed Model Training and HPO (Katib, TFJob, PyTorch Job…)
●  Addresses One of the key goals for model builder
persona:
Distributed Model Training and Hyper parameter
optimization for Tensorflow, PyTorch etc.
●  Common problems in HP optimization
○  Overfitting
○  Wrong metrics
○  Too few hyperparameters
●  Katib: a fully open source, Kubernetes-native
hyperparameter tuning service
○  Inspired by Google Vizier
○  Framework agnostic
○  Extensible algorithms

Kubernetes
Compute cluster
GPU, TPU ,CPU
Cloud Object
Storage
Model Assets.
31
Istio
Knative
KFServing
Serving and Management: KFServing
Bringing the power of Knative and Istio for serverless Model deployments
PRE-
PROCESS
PREDICT POST-
PROCESS
EXPLAIN

Manages the hosting aspects of your models

•  KFService - manages the lifecycle of models
•  Configuration - manages history of model
deployments. Two configurations for default
and canary.
•  Revision - A snapshot of your model version
•  Config and image
•  Route - Endpoint and network traffic
management
Route Default
Configuration
Revision 1
Revision M 90
%
KFService
Canary
Configuration
Revision 1
Revision N 10
%
KFServing: Default and Canary Configurations

Kubeflow Pipelines
-  Released to Kubeflow in Nov 2018,
integrated into KF deployment CLI and
1-click-deploy-app
-  Aimed to bring
-  Orchestration for complex ML
workflows
-  Reproducible and reliable
experimentation
-  Bridging experimentation and
operationalization
-  Composition and reusable ML
components and pipelines
Kubeflow
Pipelines

Experiment Tracking
34
•  Kubeflow offers an easy way to compare different runs of the pipeline.
•  You can create the pipeline with model training. Then run it multiple times with different parameter values,
and you’ll get accuracy and ROC AUC scores for every run compared.
•  Lot more under “Compare runs” view.

What constitutes a Kubeflow ML Pipeline
§  Containerized implementations of ML Tasks
§  Pre-built components: Just provide params or code
snippets (e.g. training code)
§  Create your own components from code or libraries
§  Use any runtime, framework, data types
§  Attach k8s objects - volumes, secrets
§  Specification of the sequence of steps
§  Specified via Python DSL
§  Inferred from data dependencies on input/output
§  Input Parameters
§  A “Run” = Pipeline invoked w/ specific parameters
§  Can be cloned with different parameters
§  Schedules
§  Invoke a single run or create a recurring scheduled
pipeline

Define Pipeline with Python SDK
@dsl.pipeline(name='Taxi Cab Classification Pipeline Example’)
def taxi_cab_classification(
output_dir,
project,
Train_data = 'gs://bucket/train.csv',
Evaluation_data = 'gs://bucket/eval.csv',
Target = 'tips',
Learning_rate = 0.1, hidden_layer_size = '100,50’, steps=3000):

tfdv = TfdvOp(train_data, evaluation_data, project, output_dir)
preprocess = PreprocessOp(train_data, evaluation_data, tfdv.output[“schema”], project, output_dir)
training = DnnTrainerOp(preprocess.output, tfdv.schema, learning_rate, hidden_layer_size, steps,
target, output_dir)
tfma = TfmaOp(training.output, evaluation_data, tfdv.schema, project, output_dir)
deploy = TfServingDeployerOp(training.output)
Compile and Submit Pipeline Run
dsl.compile(taxi_cab_classification, 'tfx.tar.gz')
run = client.run_pipeline(
'tfx_run', 'tfx.tar.gz', params={'output': ‘gs://dpa22’, 'project': ‘my-project-33’})

Creating your own components
-  Ways to build reusable components for pipelines
-  Create a container with your code and write either a ContainerOp() or shareable component descriptor
-  Turn your python code into a component directly in the notebook (with or without building a container)
-  These components can be exported into a shareable format
Container
Image
Execution
Code
ContainerOp
I/O schema
Code
ContainerOp
Container
Image
I/O schema
Container build
Container build
Component
descriptor

Watson AI Operations: Kubeflow Pipelines

e.g. Watson Speech to Text Operations Kubeflow pipeline

e.g. Watson Machine Learning and Watson OpenScale Pipeline

TFX

What is TFX?
TL;DR
●  TFX is a platform that to deploy Tensorflow models in production
●  TFX pipelines consist of a set of integrated components
●  TFX pipelines are configured using python
●  TFX consists of components, executors, and libraries
●  TFX components are optional (and repeated)
●  TFX can be configured to run in many different ways
42

TFX has existed externally as open source
libraries.
4343
Open sourced TFX libraries (circa 2018)
TensorFlow
Data Validation
TensorFlow
Transform
TensorFlow
Model Analysis
TensorFlow
Serving

In 2019, the horizontal layers that integrate
TFX libraries as one platform were open
sourced
4444
Open sourced TFX platform (2019)
Data
Ingestion
TensorFlow
Data Validation
TensorFlow
Transform
Estimator
or Keras
Model
TensorFlow
Model Analysis
TensorFlow
Serving
Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization

Anatomy of a Component
TFX components consist of
three main pieces:
●  Driver
●  Executor
●  Publisher
45

Anatomy of a Component
TFX includes both libraries and pipeline components. This diagram illustrates the relationships
between TFX libraries and pipeline components. TFX provides several Python packages that are
the libraries which are used to create pipeline components
46

TFX (inside the box)
47
Other runtimes
ExampleGen
StatisticsGen SchemaGen
Example
Validator
Transform Trainer
Evaluator
Model
Validator
Pusher
TFX Config
Metadata Store
Training +
Eval Data
TensorFlow
Serving
TensorFlow
Hub
TensorFlow
Lite
TensorFlow
JS
TFX Pipeline

TFX uses ml-metadata for artifact
management.
48
Trainer
Task-Aware Pipelines
Input Data
Transformed
Data
Trained
Models
Serving
System
Task- and Data-Aware Pipelines
Pipeline + Metadata Storage
Training Data
Transform TrainerTransform

Snapshot of a component.
49
Metadata Store
Trainer
Config
Last Validated
Model
New (Candidate)
Model
New Model
Model
Validator
Validation
Outcome
Pusher
New (Candidate)
Model
Validation
Outcome

What’s in the Metadata Store?
50
Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs
Lineage Tracking Across All Executions
E.g., to recurse back to all inputs of a specific artifact

Examples of Metadata-Powered Functionality.
51
Find out which data a model was trained on Compare previous model runs
Carry-over state from previous models Re-use previously computed outputs

demo
https://github.com/kubeflow/kfp-tekton/tree/master/samples/kfp-tfx

Center for Open-Source Data & AI Technologies (CODAIT) / June 28, 2019 / © 2019 IBM Corporation
Model Goal: Will the customer tip more or less than 20%?

TFX Taxi Pipeline
Private Cloud Public Cloud
On-prem
PVC
KubeFlow Pipelines
Istio + Kiali
Kubeflow Serving
Trigger model
deployment
Object
Storage

Lessons
Learnt

Recommendations for TFX and KFP
•  TFX Pipelines shall be made executable without any dependency on public cloud service e.g. GCS.
•  Apache Beam is a strong dependency in TFX. Doesn’t support S3 natively
•  TFX DSL shall support dynamically creating Persistent Volume Claims.
•  Support for mixing and matching KFP ContainerOps components with TFX ones through DSL
•  IBM IKS runs Kubernetes with containerd, Openshift uses CRIO APIs. The underlying pipeline platforms
(Argo, Airflow, Beam etc) should support them as first class citizens
•  Visualizing artifacts on the KubeFlow Pipeline UI shall not be limited to GCS by default
•  Don’t assume root privileges on OpenShift and Kube, as well as underlying storage file system.

TFX and KFP DSL
DEMO CODE:
https://github.com/kubeflow/kfp-tekton/tree/master/samples/kfp-tfx
RFC for TFX and KFP DSL Merge
https://docs.google.com/document/d/1_n3q0mNOr7gUSM04yaA0e5BO9RrS0Vkh1cNCyrB07WM/edit#
Please reach out at @AnimeshSingh for any follow-on discussions

Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]

More Related Content

What's hot

Similar to Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]

More from Animesh Singh

Recently uploaded

Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]