Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine learning on kubernetes


Published on

Talk on ML on K8s and associated components.

Published in: Technology
  • Be the first to comment

Machine learning on kubernetes

  1. 1. Machine Learning on Kubernetes 13 Dec 2017 Anirudh Ramanathan Software Engineer on Kubernetes Twitter: @anirudh4444
  2. 2. Disclaimer I’m not a Machine Learning expert. I work on infrastructure and distributed systems for a living.
  3. 3. Kubernetes a year ago... ● Was used primarily for stateless workloads ● Needed an understanding of several core concepts to operate ● Applications had to be written to fit into core controller abstractions
  4. 4. Kubernetes today... ● Has abstractions to support Stateful applications and now data processing and machine learning. ● Has a wide range of extension points including ones that allow API extensions and custom controllers. ● Has support for building higher level abstractions and APIs to hide infrastructure & operational complexity.
  5. 5. What’s changed? ● Workload controller abstractions moving to GA/stable. ● Custom Resource Definitions & Aggregated API Servers ● Kubernetes Operators ● Community support for external frameworks ● Work on scheduling and resource management (ongoing)
  6. 6. Machine Learning Solving problems without explicitly knowing how to create solutions
  7. 7. Machine Learning Infrastructure TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
  8. 8. Machine Learning Infrastructure TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
  9. 9. Kubeflow Our goal is not to recreate other services, but to provide a straightforward way for spinning up best of breed OSS solutions. ● A JupyterHub to create & manage interactive Jupyter notebooks ● A Tensorflow Training Controller that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting ● A TF Serving container
  10. 10. JupyterHub ● A single hub & proxy for managing interactive sessions ● Can run entirely within Kubernetes - notebooks are backed by Kubernetes pods ● Can request required resources - CPUs, GPUs, etc ● Has pluggable authentication (oauth, kdc, etc) Made possible by:
  11. 11. Tensorflow Training Controller ● A Kubernetes “operator” to help run distributed/non-distributed TF training. ● Exposes an API through a CustomResourceDefinition ● Controller manages complexity of distributed training using Tensorflow. Made possible by:
  12. 12. Tensorflow Serving ● A Kubernetes Deployment that can serve saved models ● Deployment - replicas can be scaled. Future work: ● Custom metrics & Autoscaling
  13. 13. But there were so many stages! ● Clearly there are many other challenges faced by people building Machine Learning infrastructure. ● How do I preprocess data? ● How do I describe my pipeline? ● How do I orchestrate my pipeline? ● We have some ideas.
  14. 14. Apache Spark ● Spark on Kubernetes is an ongoing effort since Dec 2016. ● It is being upstreamed into Spark and expected to land in Spark 2.3 (due sometime in January). ● The changes make Spark itself aware of a new Kubernetes Scheduler that can directly run Spark applications for the user.
  15. 15. Apache Spark Spark Core Kubernetes Scheduler Backend Kubernetes Cluster add executors rm executors configuration
  16. 16. Apache Spark Kubernetes Scheduler for Spark ● Spark 2.3 will support ○ Running Java/Scala jobs ○ Static allocation of executors ○ Some dependency management ● Our fork ( has several additional features which we’re slowly upstreaming. ○ It’s being run by several organizations right now.
  17. 17. Apache Airflow ● A DAG scheduler. ● Has a rich ecosystem of “operators” to allow interacting with different applications. ● Community working on a Kubernetes native executor for Airflow. ● Currently in the process of being upstreamed.
  18. 18. Apache Airflow BashOperator( task_id = ‘account-test’, bash_command = ‘’, dag = dag, executor_config = { ‘request_memory’: ‘128Mi’, ‘limit_memory’: ‘128Mi’ ‘image’: ‘airflow/scipy:1.1.5’ } ) The operators can specify various Kubernetes executor constraints within each DAG step. For example:
  19. 19. Putting it all together HDFS or GCS/S3 Spark Airflow Pipeline JupyterHub Tensorflow Other ML Frameworks
  20. 20. Get Involved Kubeflow ● Slack Channel (See for joining instructions) ● Twitter ( ● Mailing List (!forum/kubeflow-discuss) SIG Big Data ● Slack Channel ( ● Mailing list (!forum/kubernetes-sig-big-data) ● Weekly meeting (
  21. 21. Questions?