Kubeflow Pipelines and TensorFlow Extended (TFX) together is end-to-end platform for deploying production ML pipelines. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system. In this talk we describe how how to run TFX in hybrid cloud environments.
2. Hybrid Cloud, Kubeflow and TFX
Animesh Singh
Chief Architect
Data and AI Platform
CODAIT
Tommy Li
Software Engineer
CODAIT
Pete MacKinnon
Principal Software Engineer,
RedHat AICoE
3. Center for Open Source
Data and AI
Technologies (CODAIT)
Code – Build and improve practical frameworks to
enable more developers to realize immediate
value.
Content – Showcase solutions for complex and
real-world AI problems.
Community – Bring developers and data
scientists to engage with IBM
Improving Enterprise AI lifecycle in
Open Source
• Team contributes to over 10 open source projects
• Committers in Kubeflow, Spark, Tensorflow, PyTorch, ONNX…
• 17 committers and many contributors in Apache projects
• Speakers at over 100 conferences, meetups, unconferences and more
CODAIT
codait.org
↳ codait.org
7. …and is much more complex….
Data
cleansing
Data
analysis
Data
transformation
Data
validation
Data
splitting
Data
prep
Building
a model
Model
validation
Training
at scale
Model
creation
Deploying Serving
Monitoring &
Logging
Finetune &
improvements
Rollout
Training
optimization Model
Model
Data
Data
Data
ingestion
EdgeCloud
9. Hybrid Cloud
Definition:
A cloud that is:
• Inclusive of on-prem and public
• Multicloud
• Open
• Secure
• Managed
To scale enterprise workloads across the globe
17. OpenShift is our Private Cloud for ML
Workloads
EXISTING
AUTOMATION
TOOLSETS
SCM
(GIT)
CI/CD
SERVICE LAYER
PERSISTEN
T
STORAGE
REGISTRY
RHEL
NODE
c
RHEL
NODE
RHEL
NODE
RHEL
NODE
RHEL
NODE
RHEL
NODE
C
C
C C
C
C
C CC C
RED HAT
ENTERPRISE LINUX
MASTER
API/AUTHENTICATION
DATA STORE
SCHEDULER
HEALTH/SCALING
PHYSICAL VIRTUAL PRIVATE PUBLIC HYBRID
DATA SCIENTIST
ML deployed across
clouds, data center,
and edge
ML services,
load balanced
and scaled
ML microservices
scheduled and
orchestrated on
shared resources
Best of SDLC
ML in Production
17
19. The Open Data Hub Project
● OpenDataHub.io
● Meta-operator that integrates best open source AI/ML/Data projects
● Blueprint architecture for AI/ML on OpenShift
https://opendatahub.io/docs/architecture.html
Data
Acquisition & Preparation
ML Model
Selection, Training, Testing
ML Model Deployment in
App. Dev. Process
23. Red Hat OpenShift on IBM Cloud
Armada API
Carrier
Cluster
Workers
Master Master
Cluster
Workers
24. Red Hat OpenShift on IBM Cloud
Armada API
Carrier
Cluster
Workers
Master Master
Cluster
Workers
RH OpenShift
Carrier
Cluster
Workers
Master Master
Cluster
Workers
25. Red Hat OpenShift on IBM Cloud
Ubuntu Linux
ARMADA API
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
CARRIER
WORKER
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
Containerd
Kubelet
(Community)
Calico Agent
CLUSTER
WORKER
26. Red Hat OpenShift on IBM Cloud
Ubuntu Linux
ARMADA API
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
CARRIER
WORKER
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
Containerd
Kubelet
(Community)
Calico Agent
CLUSTER
WORKER
RHEL
REDHAT
CARRIER
CRI-O
Kubelet
(OpenShift)
Calico Agent
RHEL
CRI-O
Kubelet
(OpenShift)
Calico Agent
REDHAT
WORKER
27. Red Hat OpenShift on IBM Cloud
Ubuntu Linux
ARMADA API
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
CARRIER
WORKER
Containerd
Kubelet
(Community)
Calico Agent
Ubuntu Linux
Containerd
Kubelet
(Community)
Calico Agent
CLUSTER
WORKER
RHEL
REDHAT
CARRIER
CRI-O
Kubelet
(OpenShift)
Calico Agent
RHEL
CRI-O
Kubelet
(OpenShift)
Calico Agent
REDHAT
WORKER
30. Distributed Model Training and HPO (Katib, TFJob, PyTorch Job…)
● Addresses One of the key goals for model builder
persona:
Distributed Model Training and Hyper parameter
optimization for Tensorflow, PyTorch etc.
● Common problems in HP optimization
○ Overfitting
○ Wrong metrics
○ Too few hyperparameters
● Katib: a fully open source, Kubernetes-native
hyperparameter tuning service
○ Inspired by Google Vizier
○ Framework agnostic
○ Extensible algorithms
31. Kubernetes
Compute cluster
GPU, TPU ,CPU
Cloud Object
Storage
Model Assets.
31
Istio
Knative
KFServing
Serving and Management: KFServing
Bringing the power of Knative and Istio for serverless Model deployments
PRE-
PROCESS
PREDICT POST-
PROCESS
EXPLAIN
33. Kubeflow Pipelines
- Released to Kubeflow in Nov 2018,
integrated into KF deployment CLI and
1-click-deploy-app
- Aimed to bring
- Orchestration for complex ML
workflows
- Reproducible and reliable
experimentation
- Bridging experimentation and
operationalization
- Composition and reusable ML
components and pipelines
Kubeflow
Pipelines
34. Experiment Tracking
34
• Kubeflow offers an easy way to compare different runs of the pipeline.
• You can create the pipeline with model training. Then run it multiple times with different parameter values,
and you’ll get accuracy and ROC AUC scores for every run compared.
• Lot more under “Compare runs” view.
35. What constitutes a Kubeflow ML Pipeline
§ Containerized implementations of ML Tasks
§ Pre-built components: Just provide params or code
snippets (e.g. training code)
§ Create your own components from code or libraries
§ Use any runtime, framework, data types
§ Attach k8s objects - volumes, secrets
§ Specification of the sequence of steps
§ Specified via Python DSL
§ Inferred from data dependencies on input/output
§ Input Parameters
§ A “Run” = Pipeline invoked w/ specific parameters
§ Can be cloned with different parameters
§ Schedules
§ Invoke a single run or create a recurring scheduled
pipeline
37. Creating your own components
- Ways to build reusable components for pipelines
- Create a container with your code and write either a ContainerOp() or shareable component descriptor
- Turn your python code into a component directly in the notebook (with or without building a container)
- These components can be exported into a shareable format
Container
Image
Execution
Code
ContainerOp
I/O schema
Code
ContainerOp
Container
Image
I/O schema
Container build
Container build
Component
descriptor
42. What is TFX?
TL;DR
● TFX is a platform that to deploy Tensorflow models in production
● TFX pipelines consist of a set of integrated components
● TFX pipelines are configured using python
● TFX consists of components, executors, and libraries
● TFX components are optional (and repeated)
● TFX can be configured to run in many different ways
42
43. TFX has existed externally as open source
libraries.
4343
Open sourced TFX libraries (circa 2018)
TensorFlow
Data Validation
TensorFlow
Transform
TensorFlow
Model Analysis
TensorFlow
Serving
44. In 2019, the horizontal layers that integrate
TFX libraries as one platform were open
sourced
4444
Open sourced TFX platform (2019)
Data
Ingestion
TensorFlow
Data Validation
TensorFlow
Transform
Estimator
or Keras
Model
TensorFlow
Model Analysis
TensorFlow
Serving
Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
45. Anatomy of a Component
TFX components consist of
three main pieces:
● Driver
● Executor
● Publisher
45
46. Anatomy of a Component
TFX includes both libraries and pipeline components. This diagram illustrates the relationships
between TFX libraries and pipeline components. TFX provides several Python packages that are
the libraries which are used to create pipeline components
46
47. TFX (inside the box)
47
Other runtimes
ExampleGen
StatisticsGen SchemaGen
Example
Validator
Transform Trainer
Evaluator
Model
Validator
Pusher
TFX Config
Metadata Store
Training +
Eval Data
TensorFlow
Serving
TensorFlow
Hub
TensorFlow
Lite
TensorFlow
JS
TFX Pipeline
48. TFX uses ml-metadata for artifact
management.
48
Trainer
Task-Aware Pipelines
Input Data
Transformed
Data
Trained
Models
Serving
System
Task- and Data-Aware Pipelines
Pipeline + Metadata Storage
Training Data
Transform TrainerTransform
49. Snapshot of a component.
49
Metadata Store
Trainer
Config
Last Validated
Model
New (Candidate)
Model
New Model
Model
Validator
Validation
Outcome
Pusher
New (Candidate)
Model
Validation
Outcome
50. What’s in the Metadata Store?
50
Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs
Lineage Tracking Across All Executions
E.g., to recurse back to all inputs of a specific artifact
51. Examples of Metadata-Powered Functionality.
51
Find out which data a model was trained on Compare previous model runs
Carry-over state from previous models Re-use previously computed outputs
52.
53. Hybrid Clouds, TFX and Kubeflow Pipelines
demo
https://github.com/kubeflow/kfp-tekton/tree/master/samples/kfp-tfx
58. Recommendations for TFX and KFP
• TFX Pipelines shall be made executable without any dependency on public cloud service e.g. GCS.
• Apache Beam is a strong dependency in TFX. Doesn’t support S3 natively
• TFX DSL shall support dynamically creating Persistent Volume Claims.
• Support for mixing and matching KFP ContainerOps components with TFX ones through DSL
• IBM IKS runs Kubernetes with containerd, Openshift uses CRIO APIs. The underlying pipeline platforms
(Argo, Airflow, Beam etc) should support them as first class citizens
• Visualizing artifacts on the KubeFlow Pipeline UI shall not be limited to GCS by default
• Don’t assume root privileges on OpenShift and Kube, as well as underlying storage file system.