Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Engineer's Lunch #80: Apache Spark Resource Managers

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 13 Ad

Data Engineer's Lunch #80: Apache Spark Resource Managers

Download to read offline

In Data Engineer's Lunch #80, Obioma Anomnachi will compare and contrast the different resource managers available for Apache Spark. We will cover local, standalone, YARN, and Kubernetes resource managers and discuss how each one allows the user different levels of control over how resources given to spark are distributed to Spark applications.

In Data Engineer's Lunch #80, Obioma Anomnachi will compare and contrast the different resource managers available for Apache Spark. We will cover local, standalone, YARN, and Kubernetes resource managers and discuss how each one allows the user different levels of control over how resources given to spark are distributed to Spark applications.

Advertisement
Advertisement

More Related Content

More from Anant Corporation (20)

Recently uploaded (20)

Advertisement

Data Engineer's Lunch #80: Apache Spark Resource Managers

  1. 1. Version 1.0 Apache Spark Resource Managers Comparison of Local, Standalone, YARN, and Kubernetes resource managers for Apache Spark. Obioma Anomnachi Engineer @ Anant
  2. 2. Apache Spark ● Apache Spark: ○ Open-source unified analytics engine for large scale data processing ○ Provides an interface for programmingentire clusters with implicit data parallelism and fault tolerance
  3. 3. Spark Resource Managers ● Resource managers (a.k.a cluster managers) determine the distribution of resource and work on a given Spark cluster. ○ Separate from the resources give to master/workernodes, these systems determine how resources are split up within the nodes and how applications are granted access to them ○ This is also a different setting than –deploy-mode which deals mostly with the location of the driver (client mode puts it on the machine running the spark-submit while cluster mode puts it on the cluster) ● Resource Manager Options ○ Local (the validity of this as one of the resource managers is up for debate) ○ Spark Standalone ○ YARN ○ Mesos (depreciated as of Spark 3.2.0 - up for removal) ○ Kubernetes
  4. 4. Local ● Technically this is not a resource manager, local mode forgoes the need for a resource manager by running everything within a single JVM process. ● Activated by providing local, local[n], or local[*] as the value for the --master flag when running spark binaries (spark-shell, spark-submit, pyspark) ○ Local defaults to a single thread, local[n] creates n threads, and local[*] creates as many threads as cpu cores available to the JVM ● If you don’t provide a –master flag value spark defaults to local[*]
  5. 5. Spark Standalone ● Bundled with open source Spark ○ Requires a compiled version of Spark on each node ● Simple architecture with no extra functionality ● Consists of a master process and potentially many worker processes ○ The master accepts applications and schedules worker resources ○ Worker processes launch executors that perform task execution ● Launch a cluster using the start-master.sh and start- worker.sh scripts on relevant machines or set up conf/workers and use the provided launch scripts
  6. 6. Spark Standalone Resource Allocation ● The user configures the pool of resources available to the worker processes ● Only allows FIFO scheduling between applications (if two request the same resources whoever asked first gets them) ● By default applications use all the cores available on the cluster. This would limit the number of jobs that can be run at once to one. To avoid this, set spark.cores.max in the SparkConf ● The number of cores assigned to each executor is also configurable. If spark.executor.cores is set, several executors for an application can run on one worker. Otherwise limited to one executor per worker, that uses all the cores on that worker. ● Single master node means single point of failure, this can be mitigated via ZooKeeper or local file system based per-node recovery
  7. 7. YARN ● Consists of a single Resource Manager and a Node Manager for each node ● Applications run inside containers, with an application master in a container by itself ○ In YARN the application master requests resources for applications from the Resource Manager ○ In Spark the Spark driver acts as the application master ● Node managers track resource usage and report back to the resource manager
  8. 8. YARN Resource Allocation ● Three modes of resource scheduling ○ FIFO - same as standalone ○ Capacity - guarantees capacity availability for organizations ○ Fair - all applications get an equal share ● Defaults to two executors per node and one core per executor ● Needs memory overhead for internal JVM container processes - if executor uses memory over executor memory + memoryOverhead, container crashes ● YARN UI for applications is different from Spark Standalone’s Spark Master UI ● Offers dynamic allocation - basically starts with a specified number of executors and can request more from the resource manager if tasks are waiting for a long time
  9. 9. Mesos ● Mesos uses masters, workers, and frameworks (similar to applications in Spark Standalone) ● Masters schedule worker resources among frameworks that want them and worker launch executors which execute tasks ● Mesos can schedule non-Spark applications and resources like disk space and network ports as well as CPU and memory ● Mesos offer resources to frameworks rather than the framework demanding resources from the cluster
  10. 10. Mesos Resource Allocation ● In coarse-grained mode, runs one Spark executor per Mesos worker ● In fine-grained mode, runs one Spark executor per Spark task (one type of Mesos task) ● Mesos tasks executed in containers - either linux cgroups or docker containers ● Has its own UI showing frameworks ● Scheduling done by the resource-allocation module on the Mesos master and the framework’s internal scheduler (the fine-grained and coarse-grained Spark scheduler live here) ● Uses the Dominant Resource Fairness algorithm, distributing other resources by offering them to frameworks currently using the least resources by the DRF
  11. 11. Kubernetes ● Uses spark-submit to submit Spark applications to a Kubernetes cluster ○ Creates a Spark driver in a Kubernetes pod ○ The driver creates executors in their own pods ○ Once the application is complete the executor pods terminate and get cleaned up and the driver pod hangs around in the Kubernetes APi with its logs until it gets cleaned up ● Spark 2.3+ ships with a dockerfile for building a Docker image for use with Kubernetes ● Resources all managed by Kubernetes
  12. 12. Resources ● Spark in Action ● Spark Standalone Docs ● Spark YARN Docs ● Spark Mesos Docs
  13. 13. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037

×