Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]

791 views

Published on

Kubeflow Pipelines and TensorFlow Extended (TFX) together is end-to-end platform for deploying production ML pipelines. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system. In this talk we describe how how to run TFX in hybrid cloud environments.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]

  1. 1. Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX] Session by:
  2. 2. Hybrid Cloud, Kubeflow and TFX Animesh Singh Chief Architect Data and AI Platform CODAIT Tommy Li Software Engineer CODAIT Pete MacKinnon Principal Software Engineer, RedHat AICoE
  3. 3. Center for Open Source Data and AI Technologies (CODAIT) Code – Build and improve practical frameworks to enable more developers to realize immediate value. Content – Showcase solutions for complex and real-world AI problems. Community – Bring developers and data scientists to engage with IBM Improving Enterprise AI lifecycle in Open Source •  Team contributes to over 10 open source projects •  Committers in Kubeflow, Spark, Tensorflow, PyTorch, ONNX… •  17 committers and many contributors in Apache projects •  Speakers at over 100 conferences, meetups, unconferences and more CODAIT codait.org ↳ codait.org
  4. 4. The Machine Learning Workflow 4
  5. 5. Perception
  6. 6. And the ML workflow spans teams …
  7. 7. …and is much more complex…. Data cleansing Data analysis Data transformation Data validation Data splitting Data prep Building a model Model validation Training at scale Model creation Deploying Serving Monitoring & Logging Finetune & improvements Rollout Training optimization Model Model Data Data Data ingestion EdgeCloud
  8. 8. Hybrid Clouds, TFX and Kubeflow Pipelines Hybrid Cloud
  9. 9. Hybrid Cloud Definition: A cloud that is: •  Inclusive of on-prem and public •  Multicloud •  Open •  Secure •  Managed To scale enterprise workloads across the globe
  10. 10. Compute Network Storage {Infrastructure Cloud / On-Prem 10
  11. 11. Compute Network Storage {Infrastructure Cloud / On-Prem {Platform Security Kubernetes DevOps CFApp Services 11
  12. 12. Compute Network Storage {Infrastructure Cloud / On-Prem {Platform Databases Analytics Governance {Data Security Kubernetes DevOps CFApp Services 12
  13. 13. Compute Network Storage {Infrastructure Cloud / On-Prem {Platform Databases Analytics Governance {Data Watson TensorFlow Machine learning {AI Security Kubernetes DevOps CFApp Services 13
  14. 14. Compute Network Storage {Infrastructure Cloud / On-Prem {Platform Databases Analytics Governance TensorFlow Machine learning Security Kubernetes DevOps CFApp Services 14 Watson {Data {AI
  15. 15. Compute Network Storage {Infrastructure Cloud / On-Prem {Platform Databases Analytics Governance TensorFlow Machine learning Public Cloud Private Cloud Security Kubernetes DevOps CFApp Services {Infrastructure {Platform 15 Watson {Data {AI
  16. 16. Hybrid Clouds, TFX and Kubeflow Pipelines Private Cloud
  17. 17. OpenShift is our Private Cloud for ML Workloads EXISTING AUTOMATION TOOLSETS SCM (GIT) CI/CD SERVICE LAYER PERSISTEN T STORAGE REGISTRY RHEL NODE c RHEL NODE RHEL NODE RHEL NODE RHEL NODE RHEL NODE C C C C C C C CC C RED HAT ENTERPRISE LINUX MASTER API/AUTHENTICATION DATA STORE SCHEDULER HEALTH/SCALING PHYSICAL VIRTUAL PRIVATE PUBLIC HYBRID DATA SCIENTIST ML deployed across clouds, data center, and edge ML services, load balanced and scaled ML microservices scheduled and orchestrated on shared resources Best of SDLC ML in Production 17
  18. 18. Enterprise App ML Training Data Processing ML Inference CI/CD Pipelines Data Engineer Data Scientist Infra Engineer App Developer Red Hat OpenShift Container Storage Baremetal Red Hat OpenShift Container Platform Persona Key Red Hat OpenStack Platform Multi cloud gateway Analytics RDBMS Discovery Ceph OPEN DATA HUB AI PLATFORM POWERED BY OPEN SOURCE Data Ingest 18 We are working across the Data and AI Lifecycle
  19. 19. The Open Data Hub Project ●  OpenDataHub.io ●  Meta-operator that integrates best open source AI/ML/Data projects ●  Blueprint architecture for AI/ML on OpenShift https://opendatahub.io/docs/architecture.html Data Acquisition & Preparation ML Model Selection, Training, Testing ML Model Deployment in App. Dev. Process
  20. 20. Open Data Hub Blueprint 20
  21. 21. Hybrid Clouds, TFX and Kubeflow Pipelines Public Cloud
  22. 22. IBM Cloud Architecture Infrastructure X86, Power CPU, GPU Compute Sleds Programmable Mesh Network Flash & Spinning Storage Container Orchestration & Networking: Kubernetes
  23. 23. Red Hat OpenShift on IBM Cloud Armada API Carrier Cluster Workers Master Master Cluster Workers
  24. 24. Red Hat OpenShift on IBM Cloud Armada API Carrier Cluster Workers Master Master Cluster Workers RH OpenShift Carrier Cluster Workers Master Master Cluster Workers
  25. 25. Red Hat OpenShift on IBM Cloud Ubuntu Linux ARMADA API Containerd Kubelet (Community) Calico Agent Ubuntu Linux CARRIER WORKER Containerd Kubelet (Community) Calico Agent Ubuntu Linux Containerd Kubelet (Community) Calico Agent CLUSTER WORKER
  26. 26. Red Hat OpenShift on IBM Cloud Ubuntu Linux ARMADA API Containerd Kubelet (Community) Calico Agent Ubuntu Linux CARRIER WORKER Containerd Kubelet (Community) Calico Agent Ubuntu Linux Containerd Kubelet (Community) Calico Agent CLUSTER WORKER RHEL REDHAT CARRIER CRI-O Kubelet (OpenShift) Calico Agent RHEL CRI-O Kubelet (OpenShift) Calico Agent REDHAT WORKER
  27. 27. Red Hat OpenShift on IBM Cloud Ubuntu Linux ARMADA API Containerd Kubelet (Community) Calico Agent Ubuntu Linux CARRIER WORKER Containerd Kubelet (Community) Calico Agent Ubuntu Linux Containerd Kubelet (Community) Calico Agent CLUSTER WORKER RHEL REDHAT CARRIER CRI-O Kubelet (OpenShift) Calico Agent RHEL CRI-O Kubelet (OpenShift) Calico Agent REDHAT WORKER
  28. 28. Hybrid Clouds, TFX and Kubeflow Pipelines Kubeflow Pipelines
  29. 29. 29
  30. 30. Distributed Model Training and HPO (Katib, TFJob, PyTorch Job…) ●  Addresses One of the key goals for model builder persona: Distributed Model Training and Hyper parameter optimization for Tensorflow, PyTorch etc. ●  Common problems in HP optimization ○  Overfitting ○  Wrong metrics ○  Too few hyperparameters ●  Katib: a fully open source, Kubernetes-native hyperparameter tuning service ○  Inspired by Google Vizier ○  Framework agnostic ○  Extensible algorithms
  31. 31. Kubernetes Compute cluster GPU, TPU ,CPU Cloud Object Storage Model Assets. 31 Istio Knative KFServing Serving and Management: KFServing Bringing the power of Knative and Istio for serverless Model deployments PRE- PROCESS PREDICT POST- PROCESS EXPLAIN
  32. 32. Manages the hosting aspects of your models •  KFService - manages the lifecycle of models •  Configuration - manages history of model deployments. Two configurations for default and canary. •  Revision - A snapshot of your model version •  Config and image •  Route - Endpoint and network traffic management Route Default Configuration Revision 1 Revision M 90 % KFService Canary Configuration Revision 1 Revision N 10 % KFServing: Default and Canary Configurations
  33. 33. Kubeflow Pipelines -  Released to Kubeflow in Nov 2018, integrated into KF deployment CLI and 1-click-deploy-app -  Aimed to bring -  Orchestration for complex ML workflows -  Reproducible and reliable experimentation -  Bridging experimentation and operationalization -  Composition and reusable ML components and pipelines Kubeflow Pipelines
  34. 34. Experiment Tracking 34 •  Kubeflow offers an easy way to compare different runs of the pipeline. •  You can create the pipeline with model training. Then run it multiple times with different parameter values, and you’ll get accuracy and ROC AUC scores for every run compared. •  Lot more under “Compare runs” view.
  35. 35. What constitutes a Kubeflow ML Pipeline §  Containerized implementations of ML Tasks §  Pre-built components: Just provide params or code snippets (e.g. training code) §  Create your own components from code or libraries §  Use any runtime, framework, data types §  Attach k8s objects - volumes, secrets §  Specification of the sequence of steps §  Specified via Python DSL §  Inferred from data dependencies on input/output §  Input Parameters §  A “Run” = Pipeline invoked w/ specific parameters §  Can be cloned with different parameters §  Schedules §  Invoke a single run or create a recurring scheduled pipeline
  36. 36. Define Pipeline with Python SDK @dsl.pipeline(name='Taxi Cab Classification Pipeline Example’) def taxi_cab_classification( output_dir, project, Train_data = 'gs://bucket/train.csv', Evaluation_data = 'gs://bucket/eval.csv', Target = 'tips', Learning_rate = 0.1, hidden_layer_size = '100,50’, steps=3000): tfdv = TfdvOp(train_data, evaluation_data, project, output_dir) preprocess = PreprocessOp(train_data, evaluation_data, tfdv.output[“schema”], project, output_dir) training = DnnTrainerOp(preprocess.output, tfdv.schema, learning_rate, hidden_layer_size, steps, target, output_dir) tfma = TfmaOp(training.output, evaluation_data, tfdv.schema, project, output_dir) deploy = TfServingDeployerOp(training.output) Compile and Submit Pipeline Run dsl.compile(taxi_cab_classification, 'tfx.tar.gz') run = client.run_pipeline( 'tfx_run', 'tfx.tar.gz', params={'output': ‘gs://dpa22’, 'project': ‘my-project-33’})
  37. 37. Creating your own components -  Ways to build reusable components for pipelines -  Create a container with your code and write either a ContainerOp() or shareable component descriptor -  Turn your python code into a component directly in the notebook (with or without building a container) -  These components can be exported into a shareable format Container Image Execution Code ContainerOp I/O schema Code ContainerOp Container Image I/O schema Container build Container build Component descriptor
  38. 38. Watson AI Operations: Kubeflow Pipelines
  39. 39. e.g. Watson Speech to Text Operations Kubeflow pipeline
  40. 40. e.g. Watson Machine Learning and Watson OpenScale Pipeline
  41. 41. Hybrid Clouds, TFX and Kubeflow Pipelines TFX
  42. 42. What is TFX? TL;DR ●  TFX is a platform that to deploy Tensorflow models in production ●  TFX pipelines consist of a set of integrated components ●  TFX pipelines are configured using python ●  TFX consists of components, executors, and libraries ●  TFX components are optional (and repeated) ●  TFX can be configured to run in many different ways 42
  43. 43. TFX has existed externally as open source libraries. 4343 Open sourced TFX libraries (circa 2018) TensorFlow Data Validation TensorFlow Transform TensorFlow Model Analysis TensorFlow Serving
  44. 44. In 2019, the horizontal layers that integrate TFX libraries as one platform were open sourced 4444 Open sourced TFX platform (2019) Data Ingestion TensorFlow Data Validation TensorFlow Transform Estimator or Keras Model TensorFlow Model Analysis TensorFlow Serving Logging Shared Utilities for Garbage Collection, Data Access Controls Pipeline Storage Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
  45. 45. Anatomy of a Component TFX components consist of three main pieces: ●  Driver ●  Executor ●  Publisher 45
  46. 46. Anatomy of a Component TFX includes both libraries and pipeline components. This diagram illustrates the relationships between TFX libraries and pipeline components. TFX provides several Python packages that are the libraries which are used to create pipeline components 46
  47. 47. TFX (inside the box) 47 Other runtimes ExampleGen StatisticsGen SchemaGen Example Validator Transform Trainer Evaluator Model Validator Pusher TFX Config Metadata Store Training + Eval Data TensorFlow Serving TensorFlow Hub TensorFlow Lite TensorFlow JS TFX Pipeline
  48. 48. TFX uses ml-metadata for artifact management. 48 Trainer Task-Aware Pipelines Input Data Transformed Data Trained Models Serving System Task- and Data-Aware Pipelines Pipeline + Metadata Storage Training Data Transform TrainerTransform
  49. 49. Snapshot of a component. 49 Metadata Store Trainer Config Last Validated Model New (Candidate) Model New Model Model Validator Validation Outcome Pusher New (Candidate) Model Validation Outcome
  50. 50. What’s in the Metadata Store? 50 Trained Models Type definitions of Artifacts and their Properties E.g., Models, Data, Evaluation Metrics Trainer Execution Records (Runs) of Components E.g., Runtime Configuration, Inputs + Outputs Lineage Tracking Across All Executions E.g., to recurse back to all inputs of a specific artifact
  51. 51. Examples of Metadata-Powered Functionality. 51 Find out which data a model was trained on Compare previous model runs Carry-over state from previous models Re-use previously computed outputs
  52. 52. Hybrid Clouds, TFX and Kubeflow Pipelines demo https://github.com/kubeflow/kfp-tekton/tree/master/samples/kfp-tfx
  53. 53. Center for Open-Source Data & AI Technologies (CODAIT) / June 28, 2019 / © 2019 IBM Corporation Model Goal: Will the customer tip more or less than 20%?
  54. 54. Center for Open-Source Data & AI Technologies (CODAIT) / June 28, 2019 / © 2019 IBM Corporation TFX Taxi Pipeline Private Cloud Public Cloud On-prem PVC KubeFlow Pipelines Istio + Kiali Kubeflow Serving Trigger model deployment Object Storage
  55. 55. Hybrid Clouds, TFX and Kubeflow Pipelines Lessons Learnt
  56. 56. Recommendations for TFX and KFP •  TFX Pipelines shall be made executable without any dependency on public cloud service e.g. GCS. •  Apache Beam is a strong dependency in TFX. Doesn’t support S3 natively •  TFX DSL shall support dynamically creating Persistent Volume Claims. •  Support for mixing and matching KFP ContainerOps components with TFX ones through DSL •  IBM IKS runs Kubernetes with containerd, Openshift uses CRIO APIs. The underlying pipeline platforms (Argo, Airflow, Beam etc) should support them as first class citizens •  Visualizing artifacts on the KubeFlow Pipeline UI shall not be limited to GCS by default •  Don’t assume root privileges on OpenShift and Kube, as well as underlying storage file system.
  57. 57. TFX and KFP DSL Center for Open-Source Data & AI Technologies (CODAIT) / June 28, 2019 / © 2019 IBM Corporation DEMO CODE: https://github.com/kubeflow/kfp-tekton/tree/master/samples/kfp-tfx RFC for TFX and KFP DSL Merge https://docs.google.com/document/d/1_n3q0mNOr7gUSM04yaA0e5BO9RrS0Vkh1cNCyrB07WM/edit# Please reach out at @AnimeshSingh for any follow-on discussions
  58. 58. Thank You! Session by:
  59. 59. Center for Open-Source Data & AI Technologies (CODAIT) / June 28, 2019 / © 2019 IBM Corporation

×