Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark on Kubernetes

279 views

Published on

Containerization of Spark

Published in: Data & Analytics
  • Be the first to comment

Spark on Kubernetes

  1. 1. Spark on Kubernetes Containerization of Spark https://github.com/phatak-dev/kubernetes-spark
  2. 2. ● Madhukara Phatak ● Director of Engineering,Tellius ● Work on Hadoop, Spark , ML and Scala ● www.madhukaraphatak.com
  3. 3. Agenda 1. Introduction to Containers 2. Spark and Containers 3. Introduction to Kubernetes 4. Kubernetes Abstractions 5. Static Spark Cluster on Kubernetes 6. Shortcomings of Spark Cluster on Kubernetes 7. Kubernetes as YARN 8. Spark Native Integration on Kubernetes 9. Future Work
  4. 4. Introduction to Containers
  5. 5. MicroServices ● Way of developing and deploying an application as collection of multiple services which communicate to each other with lightweight mechanisms, often an HTTP resource API ● These services are built around business capabilities and independently deployable by fully automated deployment machinery ● These services can be written in different languages and can have different deployment strategies
  6. 6. Containers ● Containerisation is os-level virtualization ● In VM world, each VM has its own copy of operating system. ● Container share common kernel in a given machine ● Very light weight ● Supports resource isolation ● Most of the time, each microservice will be deployed as independent container ● This gives ability to scale independently
  7. 7. Introduction to Docker ● Containers were available in some operating systems like solaris over a decade ● Docker popularised the containers on linux ● Docker is container runtime for running containers on multiple operating system ● Started at 2013 and now synonymous with container ● Rocket from Coreos and LXD from canonical are the alternative ones
  8. 8. Challenges with Containers ● Containers makes individual services of application scale independently, but make discovering and consuming these services challenging ● Also monitoring these services across multiple hosts are also challenging ● Ability to cluster multiple containers for big data clustering is challenge by default docker tools ● So there need to be way to orchestrate these containers when you run a lot of services on top of it
  9. 9. Container Orchestrators ● Container orchestration are the tools for orchestrating the containers on scale ● They provide mainly ○ Declarative configurations ○ Rules and Constraints ○ Provisioning on multiple hosts ○ Service Discovery ○ Health Monitoring ● Support multiple container runtimes
  10. 10. Different Container Orchestrators ● Docker Compose - Not a orchestrator, but has basic service discovery ● Docker Swarm by Docker Company ● Kubernetes by Google ● Apache Mesos with Docker integrations
  11. 11. Spark and Containers
  12. 12. Need of Spark be on Containers ● Most of the spark clusters today run on their own hardware and VM’s ● Cloud providers like AWS provide their own managed resource handlers like EMR ● But more and more non-spark workloads are getting deployed in container environments ● Managing multiple different environments to run spark and non-spark are tedious for operations and management
  13. 13. Challenges with Seperate Spark Env ● Cannot fully utilise the infrastructure when spark is not using all the hardware that’s dedicated to it ● Integrating with non-spark services are tedious as different network infrastructure needs to be deployed ● No automatic scalability in on-prem deployments ● Resource sharing and restriction cannot be uniformly applied across the multiple applications ● Setting up clustering is challenging on multiple different deployments like clouds and on-prem
  14. 14. Spark on Containers ● More and more organisations wants to unify their data pipelines on single container infrastructure ● So they want to spark to be a good citizen of the container world where kubernetes is becoming de facto standard. ● Spark when it runs on same infrastructure as other systems it becomes much easier to share and consume resources ● These are the motivations to deploy spark on kubernetes
  15. 15. Introduction to Kubernetes
  16. 16. Kubernetes ● Open source system for ○ Automating deployment ○ Scaling ○ Management of containerized applications. ● Production Grade Container Orchestrator ● Based on Borg and Omega , the internal container orchestrators used by Google for 15 years ● https://kubernetes.io/
  17. 17. Why Kubernetes ● Production Grade Container Orchestration ● Support for Cloud and On-Prem deployments ● Agnostic to Container Runtime ● Support for easy clustering and load balancing ● Support for service upgradation and rollback ● Effective Resource Isolation and Management ● Well defined storage management
  18. 18. Minikube ● Minikube is a tool that is used to run kubernetes locally ● It runs single node kubernetes cluster using virtualization layers like virtualbox, hyper-v etc ● In our example, we run minikube using virtualbox ● Very useful trying out kubernetes for development and testing purpose ● For installation steps, refer http://blog.madhukaraphatak.com/scaling-spark-with-kuber netes-part-2/
  19. 19. Kubectl ● Kubectl is a command line utility to interact with kubernetes REST API ● This allows us to create, manage and delete different resources in kubernetes ● Kubectl can connect to any kubernetes cluster irrespective where it’s running ● We need to install the kubectl with minikube for interacting with kubernetes
  20. 20. Minikube Operations ● Starting minikube minikube start ● Observe running VM in the virtualbox ● See kubernetes dashboard minikube dashboard ● Run kubectl kubectl get po
  21. 21. Kubernetes Abstractions
  22. 22. Different Types of Abstraction ● Compute Abstractions ( CPU) Abstraction related to create and manage compute entities. Ex : Pod, Deployment ● Service/Network Abstractions (Network) Abstraction related to exposing service on network ● Storage Abstractions (Disk) Disk related abstractions
  23. 23. Compute Abstractions
  24. 24. Pod Abstraction ● Pod is a collection of one or more containers ● Smallest compute unit you can deploy on the kubernetes ● Host Abstraction for Kubernetes ● All containers run in single node ● Provides the ability for containers to communicate to each other using localhost
  25. 25. Defining Pod ● Kubernetes uses YAML/Json for defining resources in its framework ● YAML is human readable serialization format mainly used for configuration ● All our examples, uses the YAML. ● We are going to define a pod , where we create container of nginx ● kube_examples/nginxpod.yaml
  26. 26. Creating and Running Pod ● Once we define the pod, we need create and run the pod kubectl create -f kube_examples/nginxpod.yaml ● See running pod kubectl get po ● Observe same on dashboard ● Stop Pod kubectl delete -f kube_examples/ngnixpod.yaml
  27. 27. Spark Static Cluster on Kubernetes
  28. 28. Spark Cluster on Kubernetes ● A Single pod is created for Spark Master ● For all workers, there will pod for each worker ● All the pods runs custom built spark image ● These pods are connected using kubernetes networking abstractions ● This creates a static spark cluster on kubernetes ● Whole Talk on Same is given before [1]
  29. 29. Resource Definition ● As the spark is not aware it’s not running on kubernetes , it doesn’t recognise the limits put on kubernetes pods ● For ex: In kubernetes we can define pod to have 1 GB RAM, but we may end up configure spark worker to have 10 GB memory ● This mismatch in resource definition makes it tedious in keeping both in sync ● The same applies for CPU and Disk bounds also
  30. 30. Static Nature ● As spark cluster is created statically, it cannot scale automatically like it can do in YARN or other standalone clusters ● This makes spark keep on consuming kubernetes resources even when nothing is going on ● This makes spark not a good neighbour to have in the cluster ● Also static nature means, it cannot request more resources when needed. Manual interversion is needed.
  31. 31. Kubernetes as YARN
  32. 32. Kubernetes as YARN ● YARN is one of the first general purpose container creation system created for big data ● In YARN , even though containers run as Java process they can run any applications using JNI ● It makes YARN a generic container management tool which can run any applications ● It’s very rarely used outside of big data even though it has generic container underpinnings
  33. 33. Spark on YARN ● When spark is deployed on YARN, spark treats YARN as a container management system ● Spark requests the containers from YARN with defined resources ● Once it acquires the containers, it builds a RPC based communication between containers to run driver and executors ● Spark can scale automatically by releasing and aquiring containers
  34. 34. Spark Native Integration with K8
  35. 35. Spark and Kubernetes ● From Spark 2.3, spark supports kubernetes as new cluster backend ● It adds to existing list of YARN, Mesos and standalone backend ● This is a native integration, where no need of static cluster is need to built before hand ● Works very similar to how spark works yarn ● Next section shows the different capabalities
  36. 36. Running Spark on Kubernets
  37. 37. Building Image ● Every kubernetes abstraction needs a image to run ● Spark 2.3 ships a script to build image of latest spark with all the dependencies needs ● So as the first step, we are going to run the script to build the image ● Once image is ready, we can run a simple spark example to see integrations is working ● ./bin/docker-image-tool.sh -t spark_2.3 build [2]
  38. 38. Run Pi Example On Kubernetes bin/spark-submit --master k8s://https://192.168.99.100:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=madhu/spark:spark_2.3 local:///opt/examples/jars/examples.jar
  39. 39. Accessing UI and Logs ● kubectl port-forward <driver-pod-name> 4040:4040 ● kubectl -n=<namespace> logs -f <driver-pod-name> ●
  40. 40. Architecture
  41. 41. Kubernetes Custom Controller ● Kubernetes Custom Controller is an extension to kubernetes API to defined and create custom resources in Kubernetes ● Spark uses customer controller to create spark driver which interns responsible for creating worker pods ● This functionality is added in 1.6 version of kubernetes ● This allows spark like frameworks to natively integrate with kubernetes
  42. 42. Architecture
  43. 43. References ● https://www.youtube.com/watch?v=Q0miRvKA4yk&t=13 s ● https://spark.apache.org/docs/2.3.0/running-on-kubernet es.html#docker-images ● https://databricks.com/session/apache-spark-on-kubern etes ● https://martinfowler.com/articles/microservices.html ● https://thenewstack.io/containers-container-orchestratio n/
  44. 44. References ● http://blog.madhukaraphatak.com/categories/kubernete s-series/ ● https://kubernetes.io/docs/home/

×