Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning using Kubeflow and Kubernetes

1,688 views

Published on

Machine Learning using Kubeflow and Kubernetes

Published in: Technology
  • Login to see the comments

Machine Learning using Kubeflow and Kubernetes

  1. 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Arun Gupta, @arungupta Machine Learning using Kubeflow and Kubernetes
  2. 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://dilbert.com/strip/2013-02-02
  3. 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Supervised/unsupervised/reinforcement learning … Data sourcing, cleanup, tagging & classification Linear/logistic regression, Random forest, Decision tree, … Linear algebra, Statistics, Probability TensorFlow, PyTorch, MXNet, Caffe2, Keras, SciKit-Learn, … Python, Julia, R, … Training and evaluating models Distributed training IntelliJ, VSCode, PyCharm, Jupyter notebook Hyperparameter Tuning GPU or CPU MLOps https://happykaty.com/2018/05/15/drinking-from-a-fire-hose/ Machine Learning is Hard!
  4. 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning 101
  5. 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FRAMEWORKS INTERFACES INFRASTRUCTURE AI Services Broadest and deepest set of capabilities T H E AW S M L S TA C K VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services ML Frameworks + Infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D & C O M P R E H E N D M E D I C A L L E X F O R E C A S TR E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker F P G A SE C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C I N F E R E N C E D L C O N T A I N E R S & A M I s E L A S T I C K U B E R N E T E S S E R V I C E E L A S T I C C O N T A I N E R S E R V I C E
  6. 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FRAMEWORKS INTERFACES INFRASTRUCTURE AI Services Broadest and deepest set of capabilities T H E AW S M L S TA C K VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services ML Frameworks + Infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D & C O M P R E H E N D M E D I C A L L E X F O R E C A S TR E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker F P G A SE C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C I N F E R E N C E D L C O N T A I N E R S & A M I s E L A S T I C K U B E R N E T E S S E R V I C E E L A S T I C C O N T A I N E R S E R V I C E
  7. 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FRAMEWORKS INTERFACES INFRASTRUCTURE AI Services Broadest and deepest set of capabilities T H E AW S M L S TA C K VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services ML Frameworks + Infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D & C O M P R E H E N D M E D I C A L L E X F O R E C A S TR E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker F P G A SE C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C I N F E R E N C E D L C O N T A I N E R S & A M I s E L A S T I C K U B E R N E T E S S E R V I C E E L A S T I C C O N T A I N E R S E R V I C E
  8. 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FRAMEWORKS INTERFACES INFRASTRUCTURE AI Services Broadest and deepest set of capabilities T H E AW S M L S TA C K VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services ML Frameworks + Infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D & C O M P R E H E N D M E D I C A L L E X F O R E C A S TR E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting F P G A SE C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C I N F E R E N C E D L C O N T A I N E R S & A M I s E L A S T I C K U B E R N E T E S S E R V I C E E L A S T I C C O N T A I N E R S E R V I C E Amazon SageMaker
  9. 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. M A C H I N E L E A R N I N G S T O R A G E Amazon Redshift + Redshift Spectrum Amazon QuickSight Amazon EMR Hadoop, Spark, Presto, Pig, Hive…19 total Amazon Athena Amazon Kinesis Amazon Elasticsearch Service AWS Glue A N A L Y T I C S Amazon S3 Standard-IA Amazon S3 Standard Amazon S3 One Zone-IA Amazon Glacier Amazon S3 Intelligent- Tiering N E W Amazon EBS Amazon S3 Glacier Deep Archive N E W Storage and Analytics for Machine Learning
  10. 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why Machine Learning on Kubernetes? Composability Portability Scalability O N - P R E M I S E S C L O U D http://www.shutterstock.com/gallery-635827p1.html
  11. 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on K8s: Without KubeFlow @aronchik
  12. 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on K8s: With KubeFlow @aronchik
  13. 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  14. 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s in KubeFlow?
  15. 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EKS: run Kubernetes in cloud Managed Kubernetes control plane, attach data plane Native upstream Kubernetes experience Platform for enterprises to run production-grade workloads Integrates with additional AWS services
  16. 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting started with Amazon EKS eksctl CLI—create Amazon EKS clusters (eksctl.io) Creates all resources needed for the cluster
  17. 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Train Inference Set up K8s for ML: Option 1 Trained model 2 3 4 Data 1
  18. 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 2a Train & inference Trained model 2 3 4 role: train role: train role: train role: inference role: inference Data 1
  19. 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 2b Train, inference, & applications role: train role: train role: train role: inference role: inference role: apps role: apps
  20. 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Scaling the cluster
  21. 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow Requirements 4 CPU, 12 GB memory, 50 GB storage kubeflow.org/docs/started/k8s/overview/
  22. 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow on Desktop MiniKF: Local Kubeflow deployment using VirtualBox and Vagrant • Minikube -> Kubernetes • MiniKF -> Kubeflow (includes minikube) Runs on macOS, Linux, and Windows Does not require k8s-specific knowledge
  23. 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow on Cloud Major cloud providers supported Choices on Amazon Web Services • Self-managed k8s on EC2: Kops, CloudFormation, Terraform • Amazon EKS
  24. 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started with Kubeflow on Amazon EKS
  25. 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook Web application to build, deploy, and train ML models Create and share documents that contain live code, equations, visualizations, and narrative text 40+ programming languages Use cases data cleaning and transformation numerical simulation data visualization machine learning
  26. 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Training using Jupyter Notebook
  27. 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow Fairing Python SDK to build, train, and deploy ML models remotely Goals: • Easily package ML training jobs • Train ML models in the cloud • Streamline the process of deploying a trained model
  28. 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/02_Fairing/02_06_fairing_e2e.ipynb Setup Kubeflow Fairing for training and prediction
  29. 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Train an XGBoost model remotely on Kubeflow Deploy the trained model to Kubeflow for prediction
  30. 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Call the prediction endpoint
  31. 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hyperparameter Tuning using Katib Hyperparameter are parameters external to the model to control the training e.g. Learning rate, batch size, epochs Tuning finds a set of hyperparameters that optimizes an objective function e.g. Find the optimal batch size and learning rate to maximize prediction accuracy Katib enables hyperparameter tuning in Kubeflow Credits @Richard Liu @Johnu (Kubeflow slack)
  32. 32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Katib Concepts Extensible Framework agnostic: TensorFlow, PyTorch, MXNet, … Customizable algorithm backend Experiment: “optimization loop” for some specific problem Suggestion: a proposed solution to the problem Trial: one iteration of the loop Job: evaluate a trial and calculate objective value
  33. 33. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Katib System Architecture Experiment Controller Hyperparameters Trial Controller Suggestion Controller Trial Trial Trial Create Experiment Metrics Trial CR Experiment = CreateExperimentCR() while not Experiment.Objective reached: Suggestion = CreateSuggestionCR() HyperParameters = Suggestion.Assignments Metrics = CreateTrialCR(HyperParameters) ReportMetrics(Metrics) Suggestion CR Metric DB Experiment CR
  34. 34. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Experiment CR Hyperparameters https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/08_Hyperparameter_Tuning/random-search-example.yaml
  35. 35. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Trial template
  36. 36. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  37. 37. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow KFServing Simple and pluggable platform for ML inference Intuitive and consistent experience Serving models on arbitrary frameworks e.g. TensorFlow, XGBoost, SciKitLearn Encapsulates GPU auto-scaling, canary rollouts Credits @ellis-bigelow (Kubeflow slack)
  38. 38. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  39. 39. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing Custom Resource S3 secret attached to Service Account Trained model https://github.com/kubeflow/kfserving/blob/master/docs/samples/s3/tensorflow_s3.yaml
  40. 40. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pluggable Interface apiVersion: "serving.kubeflow.org/v1alpha1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: default: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris" apiVersion: "serving.kubeflow.org/v1alpha1" kind: "InferenceService" metadata: name: "flowers-sample" spec: default: tensorflow: storageUri: "gs://kfserving-samples/models/tensorflow/flowers" apiVersion: "serving.kubeflow.org/v1alpha1" kind: "KFService" metadata: name: "pytorch-cifar10" spec: default: pytorch: storageUri: "gs://kfserving-samples/models/pytorch/cifar10" modelClassName: "Net"
  41. 41. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing Interface – Scikit Learn apiVersion: "serving.kubeflow.org/v1alpha1" kind: "KFService" metadata: name: "sklearn-iris" spec: default: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris" serviceAccount: inferencing-robot minReplicas: 3 maxReplicas: 10 resources: requests: cpu: 2 gpu: 1 memory: 10Gi canaryTrafficPercent: 25 canary: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris-v2" serviceAccount: inferencing-robot minReplicas: 3 maxReplicas: 10 resources: requests: cpu: 2 gpu: 1 memory: 10Gi
  42. 42. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Distributed Training using Horovod Created by Uber, hosted by LF AI Foundation Distributed training framework for TensorFlow, Keras, PyTorch, and MXNet Compared to distributed TensorFlow • Far less code changes • ~2x faster Examples at Uber: Self-driving vehicles, fraud detection, and trip forecasting Named after a traditional Russian dance
  43. 43. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow Pipelines Compose, deploy, and manage end-to-end ML workflows • End-to-end orchestration • Easy, rapid, and reliable experimentation • Easy re-use Built using Pipelines SDK • kfp.compiler, kfp.components, kfp.Client Uses Argo under the hood to orchestrate resources
  44. 44. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow Pipelines Platform UI for managing and tracking experiments, jobs, and runs Engine for scheduling multi-step ML workflows SDK for defining and manipulating pipelines and components • kubeflow-pipelines.readthedocs.io/en/latest/ Notebooks for interacting with the system using the SDK
  45. 45. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Creating Kubeflow Pipeline Components
  46. 46. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Consumer Loan Acceptance Scoring Objective • Putting hundreds of data products live • Single development -> deployment -> delivery environment • First go batch, then real-time Analytics environment • AWS • High security and compliance with regulation Typical modeling context • Structured data • Supervised learning • Internalizing interpretable models and hybrid pipelines www.credo.be
  47. 47. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Requirements Data Scientists • Hybrid, integrated, cloud-based dev env • Python • PySpark (locally + remotely on Spark cluster) • R • SQL • Version control (scripts & artifacts) ML DevOps • Seamless deployment of hybrid pipelines • Trigger-based scheduling & orchestration of runs • Monitoring & dashboard • Version control (runs & pipelines) www.credo.be
  48. 48. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecture AWS Infrastructure, connections, security S3, Spark cluster, VMs, … Amazon EKS Amazon ECR Kubeflow 0.6 Notebook dev environment Pipelines for dev & delivery ElasticStack Dashboarding Custom notebook servers www.credo.be
  49. 49. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage: FSx for Lustre High performance file system for processing Amazon S3 or on-premises data Low latency and high throughput Works natively with Amazon S3 Container Storage Interface (CSI) driver
  50. 50. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices for Optimizing Distributed Deep Learning Performance on Amazon EKS https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
  51. 51. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Advantages of KubeFlow on AWS EKS cluster provision with External traffic with to manage Lustre file system Centralized and unified K8s logs in TLS and Auth with and for your K8s API server endpoint Detect GPU instance and install kubeflow.org/docs/aws
  52. 52. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  53. 53. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kanban Board github.com/orgs/kubeflow/projects/25
  54. 54. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Application Requirements 1.0 https://github.com/kubeflow/community/blob/master/g uidelines/application_requirements.md
  55. 55. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline Choose and Optimize your ML algorithm Setup and manage environments for training Deploy model in production Collect & prepare training data Train and tune model (trial and error) Scale & manage environment in production
  56. 56. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline for Kubernetes on AWS Linear regression, decision tree, BYOA GPU- and CPU- based clusters, *operators (TensorFlow, MXNet, …) TensorFlow Serving, MXNet Model Server, Seldon, … EMR, Redshift, S3 TensorFlow, Horovod, MXNet, PyTorch, Keras, … EKS or Self-managed K8s Choose and Optimize your ML algorithm Setup and manage environments for training Deploy model in production Collect & prepare training data Train and tune model (trial and error) Scale & manage environment in production
  57. 57. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. References Workshop: eksworkshop.com/kubeflow Jupyter notebooks: github.com/aws-samples/eks- kubeflow-workshop/ Optimizing Machine Learning performance: aws.amazon.com/blogs/opensource/optimizing- distributed-deep-learning-performance-amazon-eks/

×