Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modern big data and machine learning in the era of cloud, docker and kubernetes

1,784 views

Published on

There is a major shift in web and mobile application architecture from the ‘old-school’ one to a modern ‘micro-services’ architecture based on containers. Kubernetes has been quite successful in managing those containers and running them in distributed computing environments.
Now enabling Big Data and Machine Learning on Kubernetes will allow IT organizations to standardize on the same Kubernetes infrastructure. This will propel adoption and reduce costs.
Kubeflow is an open source framework dedicated to making it easy to use the machine learning tool of your choice and deploy your ML applications at scale on Kubernetes. Kubeflow is becoming an industry standard as well!
Both Kubernetes and Kubeflow will enable IT organizations to focus more effort on applications rather than infrastructure.

Published in: Data & Analytics

Modern big data and machine learning in the era of cloud, docker and kubernetes

  1. 1. Modern Big Data & Machine Learning in the era of cloud, Docker and Kubernetes Slim Baltagi Minneapolis, Minnesota June 5th 2018
  2. 2. Agenda 1.Key takeaways 2.What is Docker? 3.What is Kubernetes? 4.Why Big Data on Kubernetes? 5.Why Machine Learning on Kubernetes? 6.How to get started? 2
  3. 3. 1. Key takeaways • There is a major shift in web and mobile application architecture from the ‘old-school’ one to a modern ‘micro- services’ architecture based on containers. Kubernetes has been quite successful in managing those containers and running them in distributed computing environments. • Now enabling Big Data and Machine Learning on Kubernetes will allow IT organizations to standardize on the same Kubernetes infrastructure. This will propel adoption and reduce costs. • Kubeflow is an open source framework dedicated to making it easy to use the machine learning tool of your choice and deploy your ML applications at scale on Kubernetes. Kubeflow is becoming an industry standard as well! • Both Kubernetes and Kubeflow will enable IT organizations to focus more effort on applications rather than infrastructure. 3
  4. 4. Agenda 1.Key takeaway 2.What is Docker? 3.What is Kubernetes? 4.Why Big Data on Kubernetes? 5.Why Machine Learning on Kubernetes? 6.How to get started? 4
  5. 5. 2. What is Docker? The Docker logo is sort of a whale / boat hybrid, filled with shipping containers. The analogy is taken from freight transport where goods are shipped in containers. 5
  6. 6. 2. What is Docker? Docker is an open source technology, released back in 2013, for development and deployment of applications in containers that package together application’s code, libraries, configurations and software dependencies into container images. 6
  7. 7. 2. What is Docker? A container is a runnable instance of an image. Container images can be pulled from a registry ( such as Docker Hub hub.docker.com, Azure Container Registry, …) and deployed anywhere the container runtime is installed: your laptop, servers on-premises, or in the cloud. 7
  8. 8. 2. What is Docker? • Some of the advantages that Docker offers: • Identical environments: Deploy and run the same way whether in development, testing or production and the application that you deploy to one environment is going to work in another. • Isolated environments for your individual applications • Version control: Instead of “patching”, new functionality is added to a micro-service by replacing existing containers with ones that incorporate new functionality. • Portability: Easy move workloads between different versions of Linux for example • Developer Productivity • Application Agility: How quickly you can evolve an application • Operational Efficiencies: containerized applications are easier to deploy. • Scale out (not up): simply start more containers 8
  9. 9. Agenda 1.Key takeaway 2.What is Docker? 3.What is Kubernetes? 4.Why Big Data on Kubernetes? 5.Why Machine Learning on Kubernetes? 6.How to get started? 9
  10. 10. 3. What is Kubernetes? • The Kubernetes logo is literally a boat’s steering wheel. • It should be an admiral’s hat because, as we will see, Kubernetes helps you manage a fleet of Docker ‘boats’, not just one! 10
  11. 11. 3. What is Kubernetes? • Kubernetes (numeronym K8s) is an open source platform for automating deployment, scaling and management of containerized applications both in cloud and on premise. • It was initially released by Google in 2014 and it is now managed by the Cloud Native Computing Foundation (CNCF). • Kubernetes has been already adopted by the largest public cloud vendors and technology providers. • Some of the companies providing Kubernetes Managed Services: Google Cloud Platform (GCP) – GKE; Microsoft Azure – AKS; Amazon AWS – EKS; Oracle – OKE; IBM Cloud Container Service; RedHat – OpenStack; Pivotal – PKS; Alibaba Cloud Container Service for Kubernetes, … • Kubernetes is being embraced by even more software vendors and enterprises. 11
  12. 12. 3. What is Kubernetes? • A Kubernetes cluster is comprised of at least one master node, which manages the cluster, and multiple worker nodes, where containerized applications run using Pods. • A Pod is a logical grouping of one or more containers. Pods enable multiple containers to run on a host machine and share resources such as: storage, networking, and container runtime information. 12
  13. 13. 3. What is Kubernetes? Some of the advantages that Kubernetes offers: • Kubernetes makes containers manageable • Portability between cloud and on-premises • Kubernetes cloud agnostic design made containerized applications to run on any platform without any changes to the application code. • Kubernetes provides two types of auto-scaling: • pod auto-scaling where more pods are automatically created in a cluster based on scaling rules, and • cluster auto-scaling where more nodes are added to a cluster based on flexible rules. • Monitoring: Rather than having to rely on ad hoc monitoring approaches, system monitoring is built into Kubernetes and provides for a wide range of features: replicas, rolling updates, auto-scaling, etc. • Better cluster resource utilization 13
  14. 14. Agenda 1.Key takeaway 2.What is Docker? 3.What is Kubernetes? 4.Why Big Data on Kubernetes? 5.Why Machine Learning on Kubernetes? 6.How to get started? 14
  15. 15. 4. Why Big Data on Kubernetes? • Big Data on Kubernetes is now a reality thanks to: • The Special Interest Group in Kubernetes Community on Big Data and the many companies collaborating on the related effort. • Kubernetes newer features such as StatefulSets, custom schedulers, custom resources, custom controllers, container storage interface, … • More persistent storage options to run stateful applications on Kubernetes, depending on data type, such as object storage, file systems, software defined storage, … • More and more Big Data/Fast Data Tools running on Kubernetes such as: Apache Spark, Apache Kafka, Apache Flink, Apache Cassandra, Apache Zookeeper, … 15
  16. 16. 4. Why Big Data on Kubernetes? Example: Apache Spark on Kubernetes • Video: Submitting Spark jobs using Kubernetes scheduler on AKS. March 16, 2018 https://www.youtube.com/watch?v=T7pAZplLiCk • Article: Running Apache Spark jobs on AKS. March 15, 2018 https://docs.microsoft.com/en-us/azure/aks/spark-job • Blog: Apache Spark 2.3 with Native Kubernetes Support. March 15, 2018 • Docs: Running Spark on Kubernetes https://apache-spark-on- k8s.github.io/userdocs/running-on-kubernetes.html 16
  17. 17. 4. Why Big Data on Kubernetes? • There are many ways to run Big Data applications such as Spark. For example: • Standalone mode using dedicated resources • YARN cluster co-resident with Hadoop • Mesos cluster alongside other Mesos applications • So, why would you run Big Data applications on Kubernetes? • In addition to all the advantages that Kubernetes offer, the following ones are particularly relevant to Big Data applications: • A single container orchestrator for all your applications • Increased server utilization • Isolation between workloads • Reduction in operational overhead • Language-agnostic distributed computing clusters 17
  18. 18. 4. Why Big Data on Kubernetes? • A single container orchestrator for all your applications: For example, Kubernetes can manage a broad range of workloads; no need to deal with YARN/HDFS for data processing and a separate container orchestrator for your other applications. This solve the problem of running Big Data applications in silos in their own clusters. • Increased server utilization: For example, share nodes between Spark and other applications by having a streaming application running to feed a streaming Spark pipeline, or a nginx pod to serve web traffic without the need to statically partition nodes. 18
  19. 19. 4. Why Big Data on Kubernetes? • Isolation between workloads: For example, Kubernetes allows you to safely co-schedule batch workloads like Spark on the same nodes as latency-sensitive servers. • Reduction in operational overhead. For example: Static clusters require greater operational know-how to do common tasks with Kafka, such as applying broker configuration updates, upgrading to a new version, and adding or decommissioning brokers. By using Kafka on Kubernetes, you can reduce the overhead for a number of common operational tasks with standard cluster resource manager features. • Containers and Kubernetes make great language- agnostic distributed computing clusters. 19
  20. 20. Agenda 1.Key takeaway 2.What is Docker? 3.What is Kubernetes? 4.Why Big Data on Kubernetes? 5.Why Machine Learning on Kubernetes? 6.How to get started? 20
  21. 21. 5. Why Machine Learning on Kubernetes? Machine Learning on Kubernetes is now a reality thanks to: • Development in Kubernetes such as Stateful applications, extension points, … • Hardware acceleration for Kubernetes from Nvidia (GPU) , Google (TPU: Tensor Processing Unit), … • Machine Learning tools running on Kubernetes such as: Kubeflow, Paddle, Seldon, RiseML, Anaconda, H2O, … • Emergence of Kubeflow, an open source framework dedicated to making it easy to use the machine learning tool of your choice and deploy your ML applications in distributed mode on Kubernetes. Kubeflow is becoming the industry standard as well! • Services such as the one from Microsoft to train and serve TensorFlow Models at scale with Kubernetes and Kubeflow on Azure Kubernetes Service AKS 21
  22. 22. 5. Why Machine Learning on Kubernetes? • You've created a machine learning model, using a tool of choice such as TensorFlow, PyTorch, or scikit-learn… Now what? • How can you ensure that the model is deployed to production and can scale as needed on incoming data? • How can you seamlessly migrate a model from your local laptop / virtual machine to your cloud platform of choice? • Kubeflow includes: • the JupyterHub platform for creating and managing Jupyter notebook servers that are used by data science and research groups • a Tensorflow Customer Resource for managing compute resources to a specific cluster size • a Tensorflow Serving container to house the machine learning application 22
  23. 23. 5 Why Machine Learning on Kubernetes? • Distributed training instead of sequential: huge time saver for large trainings • Enabling Machine Learning at large scale • Mix of GPU and CPU nodes to serve both as a training and serving platform • IT can better support data science and machine learning applications with Kubernetes as the common orchestration layer for all (containerized) applications • Ability for IT to create self-service environments for data scientists and other data users. • Single scheduling solution for multiple environments, on premise or in multiple clouds • Better resource utilization through centralized scheduling of data science and other containerized applications 23
  24. 24. Agenda 1.Key takeaway 2.What is Docker? 3.What is Kubernetes? 4.Why Big Data on Kubernetes? 5.Why Machine Learning on Kubernetes? 6.How to get started? 24
  25. 25. 6. How to get started? • Learn from some free tutorials in your browser ! • Docker & Containers https://www.katacoda.com/courses/docker • Kubernetes https://www.katacoda.com/courses/kubernetes • KubeFlow https://www.katacoda.com/kubeflow • Watch some demos • Sentiment Analysis using Kubernetes and Kubeflow, Google, May 31st 2018 https://www.youtube.com/watch?v=-ZlIuQXyD1A • OSS Unboxing – Kubeflow, Lachlan Evenson, Microsoft, May 11th 2018 https://www.youtube.com/watch?v=uL_pqP_HgcY • Do some labs • Labs for Training and Serving TensorFlow Models with Kubernetes and Kubeflow on Azure Container Service (AKS) https://github.com/Azure/kubeflow-labs • Introduction to Kubeflow on Google Kubernetes Engine (GKE) https://codelabs.developers.google.com/codelabs/kubeflow- introduction/index.html?index=..%2F..%2Fio2018#0 25
  26. 26. Thank you! Let’s keep in touch! @SlimBaltagi https://www.linkedin.com/in/slimbaltagi sbaltagi@gmail.com 26

×