Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling interactive workloads across kubernetes cluster

326 views

Published on

The Jupyter Notebook Stack has become the "de facto" platform used by data scientists to interactively work on big data problems. With the popularity of deep learning, there is also an increasing need for resources to make deep learning effective. In this session, we will discuss how we brought support for Kubernetes into Jupyter Enterprise Gateway and touch on some best practices on how to scale an interactive big data workloads across a Kubernets managed cluster.

Published in: Data & Analytics
  • Be the first to comment

Scaling interactive workloads across kubernetes cluster

  1. 1. Scaling Interactive Workloads across Kubernetes Cluster Luciano Resende Codemotion Milan - 2018 1© 2018 IBM Corporation
  2. 2. © 2018 IBM Corporation About me - Luciano Resende 2 Open Source AI Platform Architect – IBM – CODAIT • Senior Technical Staff Member at IBM, contributing to open source for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Toree, Apache Spark among other projects related to AI/ML platforms lresende@us.ibm.com https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende © 2018 IBM Corporation
  3. 3. 3 Learn Open Source @ IBM Program touches 78,000 IBMers annually Consume Virtually all IBM products contain some open source • 40,363 pkgs Per Year Contribute • >62K OS Certs per year • ~10K IBM commits per month Connect > 1000 active IBM Contributors Working in key OS projects 2018 / © 2018 IBM Corporation IBM Open Source Participation
  4. 4. 4 IBM Open Source Participation IBM generated open source innovation • 137 Code Open (dWO) projects w/1000+ Github projects • 4 graduates: Node-Red, OpenWhisk, SystemML, Blockchain fabric to full open governance in the last year • developer.ibm.com/code/open/code/ Community • IBM focused on 18 strategic communities • Drive open governance in “Centers of Gravity” • IBM Leaders drive key technologies and assure freedom of action The IBM OS Way is now open sourced • Training, Recognition, Tooling • Organization, Consuming, Contributing 2018 / © 2018 IBM Corporation
  5. 5. Center for Open Source Data and AI Technologies CODAIT codait.org 2018 / © 2018 IBM Corporation codait (French) = coder/coded https://m.interglot.com/fr/en/codait CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission 5
  6. 6. © 2018 IBM Corporation Interactive Development with Jupyter Notebooks 6© 2018 IBM Corporation
  7. 7. Jupyter Notebooks © 2018 IBM Corporation 7 Notebooks are interactive computational environments, in which you can combine code execution, rich text, mathematics, plots and rich media.
  8. 8. Jupyter Notebooks © 2018 IBM Corporation 8 • Notebook UI runs on the browser • The Notebook Server serves the ’Notebooks’ • Kernels interpret/execute cell contents – Are responsible for code execution – Abstracts different languages – 1:1 relationship with Notebook – Runs and consume resources as long as notebook is running
  9. 9. © 2018 IBM Corporation Analytics and Deep Learning Workloads 9© 2018 IBM Corporation
  10. 10. Analytics Workloads © 2018 IBM Corporation 10 Large amount of data Shared across organization in Data Lakes Multiple workload types - Data cleansing - Data Warehouse - ML and Insights
  11. 11. Deep Learning Workloads © 2018 IBM Corporation 11 Resource Intensive workloads Requires expensive hardware (GPU, TPU) Long Running training jobs - Simple MNIST takes over one hour WITHOUT a decent GPU - Other non complex deep learning model training can easily take over a day WITH GPUs
  12. 12. Local Development Environment © 2018 IBM Corporation 12
  13. 13. Analytic and AI Platforms © 2018 IBM Corporation 14 Large pool of shared computing resources • Enterprise Cloud, Public Cloud or Hybrid • Shared Data (Data Lakes/Object Storage) Distributed Consumers • Notebooks running local (users laptop) or as a service (e.g. Jupyter Hub) Different Resource Utilization Patterns • High number of idle resources
  14. 14. Limitations of Jupyter Notebook Stack © 2018 IBM Corporation Gather Data Analyze Data Machine Learning Deep Learning Deploy Model Maintain Model Python Data Science Stack Fabric for Deep Learning (FfDL) Mleap + PFA Scikit-LearnPandas Apache Spark Apache Spark Jupyter Model Asset eXchange Keras + Tensorflow 15 8 8 8 8 0 10 20 30 40 50 60 70 80 4 Nodes 8 Nodes 12 Nodes 16 NodesMaxKernels(4GBHeap) Cluster Size (32GB Nodes) MAXIMUM NUMBER OF SIMULTANEOUS KERNELS • Scalability • Jupyter Kernels running as local process • Resources are limited by what is available on the one single node that runs all Kernels and associated Spark drivers • Security • Single user sharing the same privileges • Users can see and control each other process using Jupyter administrative utilities Kernel Kernel Kernel Kernel Kernel
  15. 15. © 2018 IBM Corporation Jupyter Enterprise Gateway 16© 2018 IBM Corporation
  16. 16. Jupyter Enterprise Gateway © 2018 IBM Corporation Jupyter Enterprise Gateway at IBM Code https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/ Jupyter Enterprise Gateway source code at GitHub https://github.com/jupyter-incubator/enterprise_gateway Jupyter Enterprise Gateway Documentation http://jupyter-enterprise-gateway.readthedocs.io/en/latest/ Supported Kernels Supported Platforms 17 A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across an Apache Spark or Kubernetes cluster for Enterprise/Cloud use cases Spectrum Conductor +
  17. 17. Jupyter Enterprise Gateway Features © 2018 IBM Corporation Gather Data Analyze Data Machine Learning Deep Learning Deploy Model Maintain Model Python Data Science Stack Fabric for Deep Learning (FfDL) Mleap + PFA Scikit-LearnPandas Apache Spark Apache Spark Jupyter Model Asset eXchange Keras + Tensorflow 18 16 32 48 64 0 10 20 30 40 50 60 70 80 4 Nodes 8 Nodes 12 Nodes 16 NodesMaxKernels(4GBHeap) Cluster Size (32GB Nodes) MAXIMUM NUMBER OF SIMULTANEOUS KERNELS Optimized Resource Allocation – Utilize resources on all cluster nodes by running kernels as Spark applications in YARN Cluster Mode. – Pluggable architecture to enable support for additional Resource Managers Enhanced Security – End-to-End secure communications • Secure socket communications • Encrypted HTTP communication using SSL Multiuser support with user impersonation – Enhance security and sandboxing by enabling user impersonation when running kernels (using Kerberos). – Individual HDFS home folder for each notebook user. – Use the same user ID for notebook and batch jobs. Kernel Kernel Kernel Kernel Kernel Kernel Kernel Kernel Kernel
  18. 18. © 2018 IBM Corporation Jupyter Notebooks and Kubernetes 19© 2018 IBM Corporation
  19. 19. Deep Learning Workloads © 2018 IBM Corporation 21 Resource Intensive workloads Requires expensive hardware (GPU, TPU) Long Running training jobs - Simple MNIST takes over one hour WITHOUT a decent GPU - Other non complex deep learning model training can easily take over a day WITH GPUs
  20. 20. Jupyter & Kubernetes © 2018 IBM Corporation 22 Kubernetes Platform - Containers provides a flexible way to deploy applications and are here to stay - Containers simplify management of complicated and heterogenous AI/Deep Learning infratructure - Kubernetes enables easy management of containerized applications and resources with the benefit of Elasticity and Quality of Services Source: https://github.com/Langhalsdino/Kubernetes-GPU-Guide
  21. 21. Enterprise Gateway & Kubernetes © 2018 IBM Corporation Supported Platforms FfDL Before Enterprise Gateway After Enterprise Gateway Before Jupyter Enterprise Gateway … • Resources required for all kernels needs to be allocated during Notebook Server pod creation • Resources limited to what is physically available on the host node that runs all kernels and associated Spark drivers After Jupyter Enterprise Gateway … • Gateway pod very lightweight • Kernels in their own pod, isolation • Kernel pods built from community images: Spark-on-K8s, TensorFlow, Keras, etc.
  22. 22. Jupyter Enterprise Gateway - Kubernetes © 2018 IBM Corporation 24 Container images defined in kernelspec Community image Kernel Spark on K8 Kernel Distributed File System Vanilla Kernels Spark based kernels Gateway nb2kg nb2kg
  23. 23. © 2018 IBM Corporation 25March 30 2018 / © 2018 IBM Corporation
  24. 24. March 30 2018 / © 2018 IBM Corporation 26 • Multi-user Enterprise Gateway pod • Each kernel launched on it’s own pod • Kernel pod namespace is configurable Jupyter & Kubernetes
  25. 25. © 2018 IBM Corporation Jupyter Kernels are configured by kernelspecs • Each kernel has a correspondent kernel spec • Stored in one of the Jupyter data path • $ jupyter kernelspec list Enabling remote kernels /…/anaconda3/share/jupyter/kernels/python2/kernel.jsom
  26. 26. © 2018 IBM Corporation Process Proxy: • Abstracts kernel process represented by Jupyter framework • Pluggable class definition identified in kernelspec (kernel.json) • Manages kernel lifecycle Kernel Launcher: • Embeds target kernel • Listens on gateway communication port • Conveys interrupt requests (via local signal) • Could be extended for additional communications { "language": "python", "display_name": "Spark - Python (Kubernetes Mode)", "process_proxy": { "class_name": "enterprise_gateway.services.processproxies.k8s.KubernetesProcessP roxy", "config": { "image_name": "elyra/kubernetes-kernel-py:dev", "executor_image_name": "elyra/kubernetes-kernel-py:dev”, "port_range" : "40000..42000" } }, "env": { "SPARK_HOME": "/opt/spark", "SPARK_OPTS": "--master k8s://https://${KUBERNETES_SERVICE_HOST --deploy-mode cluster --name …", … }, "argv": [ "/usr/local/share/jupyter/kernels/spark_python_kubernetes/bin/run. sh", "{connection_file}", "--RemoteProcessProxy.response-address", "{response_address}", "--RemoteProcessProxy.spark-context-initialization-mode", "lazy" ] } Enabling remote kernels Process Proxies mixed with Kernel Launchers
  27. 27. Jupyter Enterprise Gateway Components © 2018 IBM Corporation 29 Spectrum Conductor + Supported Runtime Platforms J U P Y T E R E N T E R P R I S E G A T E W A Y Remote Kernel Manager Distributed Process Proxy YARN Cluster Process Proxy Kubernetes Process Proxy Conductor Cluster Process Proxy J U P Y T E R N O T E B O O K NB2KG Extension Lab Extension J U P Y T E R K E R N E L G A T E W A Y J U P Y T E R N O T E B O O K FfDL
  28. 28. © 2018 IBM Corporation Jupyter Notebooks and Deep Learning Platforms 30© 2018 IBM Corporation
  29. 29. Deep Learning Platforms © 2018 IBM Corporation 31 Prohibited costs - Deep Learning resources are prohibitive in costs to be locked/idle during interactive development Deep Learning Platforms - We have seen the rise of Deep Learning platforms that leverage containers and Kubernetes as the basis of their infrastructure - Kubernetes enables Deep Learning platforms to easily share and restrict accelerated hardware Fabric for Deep Learning IBM Watson Studio Deep Learning as a service Batch oriented developm ent
  30. 30. Deep Learning Workspace March 30 2018 / © 2018 IBM Corporation 32 Streamline Data Science user experience when coming from Notebook/Interactive development interfaces • Current process include multiple steps, one being decomposing the notebook into an application that needs to be submitted as a zip to the deep learning runtime which becomes a show stopper for data scientists to adopt FfDL and DLaas
  31. 31. March 30 2018 / © 2018 IBM Corporation 33 Streamline the Deep Learning application lifecycle • Run local notebook experiments, with small data samples and seamlessly validate experiments on Deep Learning environments • IBM Cloud DLaaS, FfDL (open source), KubeFlow (open source) Simplify productionalization of Model training and serving from Notebooks • Enable running/scheduling notebooks on production environments as batch jobs • Results can be made available via updated notebook, or exported to html, pdf and a few other formats. Interactive development lifecycle done on commodity hardware with sampled data Training on full dataset gets scheduled as batch jobs on deep learning infrastructure Deep Learning Workspace
  32. 32. © 2018 IBM Corporation 34March 30 2018 / © 2018 IBM Corporation
  33. 33. March 30 2018 / © 2018 IBM Corporation 35 • User select where to run the experiment • Job is packaged and submitted on behalf of user • User has access to Job Console to monitor experiment Deep Learning Workspace
  34. 34. © 2018 IBM Corporation Summary 36© 2018 IBM Corporation
  35. 35. © 2018 IBM Corporation Interactive Workloads across Kubernetes Cluster 37© 2018 IBM Corporation 37 + FfDL • Enable support to remote kernels in order to scale Notebook across entire cluster • Multitenant with support for user impersonation leveraging Kerberos • Base container image becomes a choice (e.g. Python with Tensorflow) J U P Y T E R E N T E R P R I S E G A T E W A Y S U P P O R T E D P L A T A F O R M S D E E P L E A R N I N G W O R K S P A C E S U P P O R T E D P L A T A F O R M S • Kernels • Runtimes • Seamlessly integrate interactive development with Deep Learning frameworks for Model training • Schedule Notebooks to run remotely
  36. 36. © 2018 IBM Corporation J U P Y T E R E N T E R P R I S E G A T E W A Y Jupyter Enterprise Gateway at IBM Code https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/ Jupyter Enterprise Gateway source code at GitHub https://github.com/jupyter/enterprise_gateway Jupyter Enterprise Gateway Documentation http://jupyter-enterprise-gateway.readthedocs.io/en/latest/ Jupyter Blog https://blog.jupyter.org/ 38 Other Resources
  37. 37. © 2018 IBM Corporation Thank you! @lresende1975
  38. 38. © 2018 IBM Corporation 40

×