Spark in yarn managed multi-tenant clusters

Spark in YARN-managed
multi-tenant clusters
Pravin Mittal (pravinm@Microsoft.com)
Rajesh Iyer (riyer@Microsoft.com)

Spark on Azure HDInsight
Fully Managed Service
• 100% open source Apache Spark and Hadoop bits
• Latest releases of Spark
• Fully supported by Microsoft and Hortonworks
• 99.9% Azure Cloud SLA; 24/7 Managed Service
• Certifications: PCI, ISO 27018, SOC, HIPAA, EU-MC
Optimized for experimentation and development
• Jupyter Notebooks (scala, python, automatic data visualizations)
• IntelliJ plugin (job submission, remote debugging)
• ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc

Make Spark Simple - Integrated with Azure
Ecosystem
• Microsoft R Server - Multi-threaded math libraries and transparent parallelization in R Server means handling up to 1000x more data and up to
50x faster speeds than open source R. This is based on open source R, it does not require any change to R scripts
• Azure Data Lake Store – HDFS for the cloud, optimized for massive throughput, Ultra-high capacity, Low Latency, Secure ACL support
• Azure Data Factory orchestrates Spark ETL pipeline
• PowerBI connector for Spark for rich visualization.Newin Power BI is a streaming connector allowing you to publish real-time events from Spark
Streaming directly to Power BI.
• EventsHub connector as a data source for Spark streaming
• Azure SQL Datawarehouse & Hbase connector for fast & scalable storage

Jupyter-Spark Integration via Livy
• Sparkmagic is an open source library that Microsoft is incubating under the Jupyter Incubator program
• Thousands of Spark clusters in production providing feedback to further improve the experience
https://github.com/jupyter-incubator/sparkmagic

Spark Execution Model
Each Spark Application is an instance of SparkContext that
gets its own executor processes that has application
lifetime
Spark is agnostic of Cluster manager as long it has
executor process that can communicate with each other
The driver program must listen for and accept incoming
connections from its executors throughout its lifetime
Driver is responsible for scheduling tasks on the cluster

Why Yarn as Cluster Manager?
Microsoft, Cloudera, Hortonworks, IBM and many other are all actively working to impove YARN
YARN allows you to dynamically share and centrally configure the same pool of cluster resources
between all frameworks that run on YARN.
YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against
Kerberized Hadoop clusters and uses secure authentication between its processes.
YARN allows us to have richer resource management policy
• Allows to maximize cluster utilization, fair resource sharing, dynamic pre-emption when running multiple concurrent application and also able to provide
different resource guarantees for Batch and Interactive workload.
[1] http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

SparkSubmit starts and talks to the
ResourceManager for the cluster
The ResourceManager makes a single container
request on behalf of the SparkSubmit
The ApplicationMaster starts running within that
container.
The ApplicationMaster requests subsequent
containers for the Spark Executors from the
ResourceManager are allocated to run tasks for
the application.
For Spark Batch Applications, all the Spark
executor containers and Application master are
freed
For Spark interactive Applications (Dynamic
Executor enabled), Spark executors are freed
after idle timeout but Application master
remains till Spark driver exits.
YARN AllocationModel for Spark
https://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/

Running Spark on YARN in HDInsight
• Requirements
• Maximize cluster utilization i.e. reduce idle resource
• Fair resource sharing between different Spark
applications
• Resource guarantee

Maximize cluster utilization
• Reduce allocating idle resource
• Application should be able to use the entire cluster if necessary
• Should be able to work with cluster scaling
• What should be the ideal setting for the number of executors for any Spark
application
• Spark static allocation
 spark.executor.instances to a large value
• Spark dynamic allocation
 spark.dynamicAllocation.enabled = true
 spark.dynamicAllocation.maxExecutors to a large value
• YARN capacity scheduler queue
 yarn.scheduler.capacity.<parent queue>.<child queue>.maximum-capacity
to 100

Fair resource sharing
• Concurrent applications should be able to share resources
• Use separate YARN capacity scheduler queues for different Spark
contexts
 Queues are statically created
 Allocated resources are not shared between different Spark contexts
 Need a way to reclaim allocated resources when another Spark context comes along
 YARN preemption AND Spark dynamic allocation
 Spark dynamic allocation gives up only idle resource
 YARN preemption to reclaim in-use resource
(yarn.resourcemanager.scheduler.monitor.enable &
yarn.resourcemanager.scheduler.monitor.policies)
 YARN preemption predictable with yarn.scheduler.capacity.resource-calculator =
DefaultResourceCalculator
 YARN JIRA YARN-4390

Fair resource sharing
• Use separate Spark resource pools for same Spark context
 Resource pools are dynamically created per context
 Allocated resources are shared between different Spark jobs
 No need to reclaim allocated resources when another Spark job comes along
• Combination of the above to support concurrently running
Notebooks, Batch and BI workloads in the same cluster

Resource guarantee
• Every spark application should be able to run immediately
• Combination
 Separate YARN capacity queues with yarn.scheduler.capacity.<parent queue>.<child
queue>.capacity used to guarantee resources for different Spark applications
 Separate Spark resource pools within the same Spark application
 YARN preemption to ensure that in-use resource can be reclaimed
 Spark dynamic allocation to ensure that idle resources can be reclaimed

Working configuration
• Spark settings
 Spark.executor.instances = <very large value>
 OR
 Spark.dynamicAllocation.enabled = true
 Spark.dynamicAllocation.initialExecutors = 0
 Spark.dynamicAllocation.minExecutors = 0
 Spark.dynamicAllocation.maxExecutors = <very large value>
• YARN settings
 yarn.resourcemanager.scheduler.monitor.enable = true
 yarn.resourcemanager.scheduler.monitor.policies =
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionP
olicy
 yarn.scheduler.capacity.resource-
calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
 yarn.scheduler.capacity.root.queues=default,<n queues>
 Yarn.scheduler.capacity.<parent_queue>.<child_queue>.capacity
 Yarn.scheduler.capacity.<parent_queue>.<child_queue>.maximum_capacity

Spark in yarn managed multi-tenant clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark in yarn managed multi-tenant clusters

Similar to Spark in yarn managed multi-tenant clusters (20)

Recently uploaded

Recently uploaded (20)

Spark in yarn managed multi-tenant clusters

Editor's Notes