SlideShare a Scribd company logo
Spark in YARN-managed
multi-tenant clusters
Pravin Mittal (pravinm@Microsoft.com)
Rajesh Iyer (riyer@Microsoft.com)
Spark on Azure HDInsight
Fully Managed Service
• 100% open source Apache Spark and Hadoop bits
• Latest releases of Spark
• Fully supported by Microsoft and Hortonworks
• 99.9% Azure Cloud SLA; 24/7 Managed Service
• Certifications: PCI, ISO 27018, SOC, HIPAA, EU-MC
Optimized for experimentation and development
• Jupyter Notebooks (scala, python, automatic data visualizations)
• IntelliJ plugin (job submission, remote debugging)
• ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc
Make Spark Simple - Integrated with Azure
Ecosystem
• Microsoft R Server - Multi-threaded math libraries and transparent parallelization in R Server means handling up to 1000x more data and up to
50x faster speeds than open source R. This is based on open source R, it does not require any change to R scripts
• Azure Data Lake Store – HDFS for the cloud, optimized for massive throughput, Ultra-high capacity, Low Latency, Secure ACL support
• Azure Data Factory orchestrates Spark ETL pipeline
• PowerBI connector for Spark for rich visualization.Newin Power BI is a streaming connector allowing you to publish real-time events from Spark
Streaming directly to Power BI.
• EventsHub connector as a data source for Spark streaming
• Azure SQL Datawarehouse & Hbase connector for fast & scalable storage
Jupyter-Spark Integration via Livy
• Sparkmagic is an open source library that Microsoft is incubating under the Jupyter Incubator program
• Thousands of Spark clusters in production providing feedback to further improve the experience
https://github.com/jupyter-incubator/sparkmagic
Spark Execution Model
Each Spark Application is an instance of SparkContext that
gets its own executor processes that has application
lifetime
Spark is agnostic of Cluster manager as long it has
executor process that can communicate with each other
The driver program must listen for and accept incoming
connections from its executors throughout its lifetime
Driver is responsible for scheduling tasks on the cluster
Why Yarn as Cluster Manager?
Microsoft, Cloudera, Hortonworks, IBM and many other are all actively working to impove YARN
YARN allows you to dynamically share and centrally configure the same pool of cluster resources
between all frameworks that run on YARN.
YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against
Kerberized Hadoop clusters and uses secure authentication between its processes.
YARN allows us to have richer resource management policy
• Allows to maximize cluster utilization, fair resource sharing, dynamic pre-emption when running multiple concurrent application and also able to provide
different resource guarantees for Batch and Interactive workload.
[1] http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
SparkSubmit starts and talks to the
ResourceManager for the cluster
The ResourceManager makes a single container
request on behalf of the SparkSubmit
The ApplicationMaster starts running within that
container.
The ApplicationMaster requests subsequent
containers for the Spark Executors from the
ResourceManager are allocated to run tasks for
the application.
For Spark Batch Applications, all the Spark
executor containers and Application master are
freed
For Spark interactive Applications (Dynamic
Executor enabled), Spark executors are freed
after idle timeout but Application master
remains till Spark driver exits.
YARN AllocationModel for Spark
https://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/
Running Spark on YARN in HDInsight
• Requirements
• Maximize cluster utilization i.e. reduce idle resource
• Fair resource sharing between different Spark
applications
• Resource guarantee
Maximize cluster utilization
• Reduce allocating idle resource
• Application should be able to use the entire cluster if necessary
• Should be able to work with cluster scaling
• What should be the ideal setting for the number of executors for any Spark
application
• Spark static allocation
 spark.executor.instances to a large value
• Spark dynamic allocation
 spark.dynamicAllocation.enabled = true
 spark.dynamicAllocation.maxExecutors to a large value
• YARN capacity scheduler queue
 yarn.scheduler.capacity.<parent queue>.<child queue>.maximum-capacity
to 100
Fair resource sharing
• Concurrent applications should be able to share resources
• Use separate YARN capacity scheduler queues for different Spark
contexts
 Queues are statically created
 Allocated resources are not shared between different Spark contexts
 Need a way to reclaim allocated resources when another Spark context comes along
 YARN preemption AND Spark dynamic allocation
 Spark dynamic allocation gives up only idle resource
 YARN preemption to reclaim in-use resource
(yarn.resourcemanager.scheduler.monitor.enable &
yarn.resourcemanager.scheduler.monitor.policies)
 YARN preemption predictable with yarn.scheduler.capacity.resource-calculator =
DefaultResourceCalculator
 YARN JIRA YARN-4390
Fair resource sharing
• Use separate Spark resource pools for same Spark context
 Resource pools are dynamically created per context
 Allocated resources are shared between different Spark jobs
 No need to reclaim allocated resources when another Spark job comes along
• Combination of the above to support concurrently running
Notebooks, Batch and BI workloads in the same cluster
Resource guarantee
• Every spark application should be able to run immediately
• Combination
 Separate YARN capacity queues with yarn.scheduler.capacity.<parent queue>.<child
queue>.capacity used to guarantee resources for different Spark applications
 Separate Spark resource pools within the same Spark application
 YARN preemption to ensure that in-use resource can be reclaimed
 Spark dynamic allocation to ensure that idle resources can be reclaimed
Working configuration
• Spark settings
 Spark.executor.instances = <very large value>
 OR
 Spark.dynamicAllocation.enabled = true
 Spark.dynamicAllocation.initialExecutors = 0
 Spark.dynamicAllocation.minExecutors = 0
 Spark.dynamicAllocation.maxExecutors = <very large value>
• YARN settings
 yarn.resourcemanager.scheduler.monitor.enable = true
 yarn.resourcemanager.scheduler.monitor.policies =
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionP
olicy
 yarn.scheduler.capacity.resource-
calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
 yarn.scheduler.capacity.root.queues=default,<n queues>
 Yarn.scheduler.capacity.<parent_queue>.<child_queue>.capacity
 Yarn.scheduler.capacity.<parent_queue>.<child_queue>.maximum_capacity
DEMO

More Related Content

What's hot

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Amazon Web Services
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
DataStax Academy
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016
StampedeCon
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
InMobi Technology
 
Best Practices for running the Oracle Database on EC2 webinar
Best Practices for running the Oracle Database on EC2 webinarBest Practices for running the Oracle Database on EC2 webinar
Best Practices for running the Oracle Database on EC2 webinar
Tom Laszewski
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
Amazon Web Services
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Spark Integration Architecture for restaurant data
Spark Integration Architecture for restaurant data�Spark Integration Architecture for restaurant data�
Spark Integration Architecture for restaurant data
David Tung
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
Amazon Web Services
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 

What's hot (20)

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
 
Best Practices for running the Oracle Database on EC2 webinar
Best Practices for running the Oracle Database on EC2 webinarBest Practices for running the Oracle Database on EC2 webinar
Best Practices for running the Oracle Database on EC2 webinar
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Spark Integration Architecture for restaurant data
Spark Integration Architecture for restaurant data�Spark Integration Architecture for restaurant data�
Spark Integration Architecture for restaurant data
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 

Viewers also liked

Big data streaming with Apache Spark on Azure
Big data streaming with Apache Spark on AzureBig data streaming with Apache Spark on Azure
Big data streaming with Apache Spark on Azure
Willem Meints
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Sascha Dittmann
 
Azure api app métricas com application insights
Azure api app métricas com application insightsAzure api app métricas com application insights
Azure api app métricas com application insights
Nicolas Takashi
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
Go Serverless with Azure Functions
Go Serverless with Azure FunctionsGo Serverless with Azure Functions
Go Serverless with Azure Functions
Jim O'Neil
 
Microsoft NYC 14
Microsoft NYC 14Microsoft NYC 14
Microsoft NYC 14
SwitchPitch
 
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
Mike Martin
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
Azure IOT
Azure IOTAzure IOT
Software scope
Software scopeSoftware scope
Software scope
Shubham Dubey
 
Going serverless
Going serverlessGoing serverless
Going serverless
TechExeter
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
Koray Kocabas
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
Ruhani Arora
 
2016-08-25 TechExeter - going serverless with Azure
2016-08-25 TechExeter - going serverless with Azure2016-08-25 TechExeter - going serverless with Azure
2016-08-25 TechExeter - going serverless with Azure
Steve Lee
 
Spark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattleSpark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattle
Judy Nash
 
Open up to a better learning ecosystem
Open up to a better learning ecosystemOpen up to a better learning ecosystem
Open up to a better learning ecosystem
Katie Bradford
 
Azure functions
Azure functionsAzure functions
Azure functions
vivek p s
 
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloudAzure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
Toradex
 
Microsoft Azure For Solutions Architects
Microsoft Azure For Solutions ArchitectsMicrosoft Azure For Solutions Architects
Microsoft Azure For Solutions Architects
Roy Kim
 
Going serverless
Going serverlessGoing serverless
Going serverless
Jeremy Green
 

Viewers also liked (20)

Big data streaming with Apache Spark on Azure
Big data streaming with Apache Spark on AzureBig data streaming with Apache Spark on Azure
Big data streaming with Apache Spark on Azure
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
Azure api app métricas com application insights
Azure api app métricas com application insightsAzure api app métricas com application insights
Azure api app métricas com application insights
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Go Serverless with Azure Functions
Go Serverless with Azure FunctionsGo Serverless with Azure Functions
Go Serverless with Azure Functions
 
Microsoft NYC 14
Microsoft NYC 14Microsoft NYC 14
Microsoft NYC 14
 
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Azure IOT
Azure IOTAzure IOT
Azure IOT
 
Software scope
Software scopeSoftware scope
Software scope
 
Going serverless
Going serverlessGoing serverless
Going serverless
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
2016-08-25 TechExeter - going serverless with Azure
2016-08-25 TechExeter - going serverless with Azure2016-08-25 TechExeter - going serverless with Azure
2016-08-25 TechExeter - going serverless with Azure
 
Spark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattleSpark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattle
 
Open up to a better learning ecosystem
Open up to a better learning ecosystemOpen up to a better learning ecosystem
Open up to a better learning ecosystem
 
Azure functions
Azure functionsAzure functions
Azure functions
 
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloudAzure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
 
Microsoft Azure For Solutions Architects
Microsoft Azure For Solutions ArchitectsMicrosoft Azure For Solutions Architects
Microsoft Azure For Solutions Architects
 
Going serverless
Going serverlessGoing serverless
Going serverless
 

Similar to Spark in yarn managed multi-tenant clusters

Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
Jianfeng Zhang
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
DataWorks Summit
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Alluxio, Inc.
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
hdhappy001
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
Bikas Saha
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Hortonworks
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Databricks
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
Hortonworks
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
Joey Echeverria
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 

Similar to Spark in yarn managed multi-tenant clusters (20)

Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Apache spark
Apache sparkApache spark
Apache spark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 

Recently uploaded

8 things to know before you start to code in 2024
8 things to know before you start to code in 20248 things to know before you start to code in 2024
8 things to know before you start to code in 2024
ArianaRamos54
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 

Recently uploaded (20)

8 things to know before you start to code in 2024
8 things to know before you start to code in 20248 things to know before you start to code in 2024
8 things to know before you start to code in 2024
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 

Spark in yarn managed multi-tenant clusters

  • 1. Spark in YARN-managed multi-tenant clusters Pravin Mittal (pravinm@Microsoft.com) Rajesh Iyer (riyer@Microsoft.com)
  • 2. Spark on Azure HDInsight Fully Managed Service • 100% open source Apache Spark and Hadoop bits • Latest releases of Spark • Fully supported by Microsoft and Hortonworks • 99.9% Azure Cloud SLA; 24/7 Managed Service • Certifications: PCI, ISO 27018, SOC, HIPAA, EU-MC Optimized for experimentation and development • Jupyter Notebooks (scala, python, automatic data visualizations) • IntelliJ plugin (job submission, remote debugging) • ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc
  • 3. Make Spark Simple - Integrated with Azure Ecosystem • Microsoft R Server - Multi-threaded math libraries and transparent parallelization in R Server means handling up to 1000x more data and up to 50x faster speeds than open source R. This is based on open source R, it does not require any change to R scripts • Azure Data Lake Store – HDFS for the cloud, optimized for massive throughput, Ultra-high capacity, Low Latency, Secure ACL support • Azure Data Factory orchestrates Spark ETL pipeline • PowerBI connector for Spark for rich visualization.Newin Power BI is a streaming connector allowing you to publish real-time events from Spark Streaming directly to Power BI. • EventsHub connector as a data source for Spark streaming • Azure SQL Datawarehouse & Hbase connector for fast & scalable storage
  • 4. Jupyter-Spark Integration via Livy • Sparkmagic is an open source library that Microsoft is incubating under the Jupyter Incubator program • Thousands of Spark clusters in production providing feedback to further improve the experience https://github.com/jupyter-incubator/sparkmagic
  • 5. Spark Execution Model Each Spark Application is an instance of SparkContext that gets its own executor processes that has application lifetime Spark is agnostic of Cluster manager as long it has executor process that can communicate with each other The driver program must listen for and accept incoming connections from its executors throughout its lifetime Driver is responsible for scheduling tasks on the cluster
  • 6. Why Yarn as Cluster Manager? Microsoft, Cloudera, Hortonworks, IBM and many other are all actively working to impove YARN YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes. YARN allows us to have richer resource management policy • Allows to maximize cluster utilization, fair resource sharing, dynamic pre-emption when running multiple concurrent application and also able to provide different resource guarantees for Batch and Interactive workload. [1] http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
  • 7. SparkSubmit starts and talks to the ResourceManager for the cluster The ResourceManager makes a single container request on behalf of the SparkSubmit The ApplicationMaster starts running within that container. The ApplicationMaster requests subsequent containers for the Spark Executors from the ResourceManager are allocated to run tasks for the application. For Spark Batch Applications, all the Spark executor containers and Application master are freed For Spark interactive Applications (Dynamic Executor enabled), Spark executors are freed after idle timeout but Application master remains till Spark driver exits. YARN AllocationModel for Spark https://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/
  • 8. Running Spark on YARN in HDInsight • Requirements • Maximize cluster utilization i.e. reduce idle resource • Fair resource sharing between different Spark applications • Resource guarantee
  • 9. Maximize cluster utilization • Reduce allocating idle resource • Application should be able to use the entire cluster if necessary • Should be able to work with cluster scaling • What should be the ideal setting for the number of executors for any Spark application • Spark static allocation  spark.executor.instances to a large value • Spark dynamic allocation  spark.dynamicAllocation.enabled = true  spark.dynamicAllocation.maxExecutors to a large value • YARN capacity scheduler queue  yarn.scheduler.capacity.<parent queue>.<child queue>.maximum-capacity to 100
  • 10. Fair resource sharing • Concurrent applications should be able to share resources • Use separate YARN capacity scheduler queues for different Spark contexts  Queues are statically created  Allocated resources are not shared between different Spark contexts  Need a way to reclaim allocated resources when another Spark context comes along  YARN preemption AND Spark dynamic allocation  Spark dynamic allocation gives up only idle resource  YARN preemption to reclaim in-use resource (yarn.resourcemanager.scheduler.monitor.enable & yarn.resourcemanager.scheduler.monitor.policies)  YARN preemption predictable with yarn.scheduler.capacity.resource-calculator = DefaultResourceCalculator  YARN JIRA YARN-4390
  • 11. Fair resource sharing • Use separate Spark resource pools for same Spark context  Resource pools are dynamically created per context  Allocated resources are shared between different Spark jobs  No need to reclaim allocated resources when another Spark job comes along • Combination of the above to support concurrently running Notebooks, Batch and BI workloads in the same cluster
  • 12. Resource guarantee • Every spark application should be able to run immediately • Combination  Separate YARN capacity queues with yarn.scheduler.capacity.<parent queue>.<child queue>.capacity used to guarantee resources for different Spark applications  Separate Spark resource pools within the same Spark application  YARN preemption to ensure that in-use resource can be reclaimed  Spark dynamic allocation to ensure that idle resources can be reclaimed
  • 13. Working configuration • Spark settings  Spark.executor.instances = <very large value>  OR  Spark.dynamicAllocation.enabled = true  Spark.dynamicAllocation.initialExecutors = 0  Spark.dynamicAllocation.minExecutors = 0  Spark.dynamicAllocation.maxExecutors = <very large value> • YARN settings  yarn.resourcemanager.scheduler.monitor.enable = true  yarn.resourcemanager.scheduler.monitor.policies = org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionP olicy  yarn.scheduler.capacity.resource- calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator  yarn.scheduler.capacity.root.queues=default,<n queues>  Yarn.scheduler.capacity.<parent_queue>.<child_queue>.capacity  Yarn.scheduler.capacity.<parent_queue>.<child_queue>.maximum_capacity
  • 14. DEMO

Editor's Notes

  1. Sparkmagic is an open source library that Microsoft is incubating under the Jupyter Incubator program Sparkmagic uses Livy, a REST server for Spark, to remotely execute all user code. The library then automatically collects the output of your code as plain text or a JSON document, displaying the results to you as formatted text or as a Pandas dataframe as appropriate. HDInsight users are deploying thousands of Spark clusters in production providing feedback that is helping us refine the user experience further
  2. Application – Spark program consist of driver program and executors on the cluster Driver program - The process running the main() function of the application and creating the SparkContext Executors - A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them Job - A parallel computation consisting of multiple tasks that gets spawned in response to Spark action