SlideShare a Scribd company logo
1 of 19
YARN Federation
(YARN-2915)
Subru Krishnan, Kishore Chaliparambil,
Carlo Curino, and Giovanni Fumarola
Microsoft
Who are we?
Large team:
• Cloud and Information Services Lab (CISL)
• Applied research group in large-scale systems and machine learning
• BigData Resource Management team
• Design, build and operate Microsoft’s big data infrastructure
Agenda
• YARN @MS
• Federation Architecture
• Policy space
• Demo
YARN @MS
Familiar Challenges:
• Diverse workloads (batch, interactive, services,…)
• Support for production SLAs
• ROI on cluster investments (utilization)
Special Challenges:
• Leverage existing strong infrastructure (Cosmos/Scope/REEF/Azure)
• Enable all OSS technologies
• Scale of first-party clusters (each can exceed 50k nodes)
• Public Cloud (security, number of tenants, service integration…)
Big Bet: Unified Resource Management through YARN (OSS)
+
Azure
+
YARN @MS: Innovate and Contribute
Problems
• Lack of SLAs for production jobs
• High utilization for a broad range of
workloads
• YARN scalability,
• Private cloud (from disjoint clusters)
• Cross-DC?
Our Solution…
• Rayon: resource reservation
framework (YARN-1051)
• Mercury: introduce container types
and node-level queueing (YARN-
2877)
• Federation: “federate” multiple
YARN clusters (YARN-2915)
YARN Federation in Apache
• Umbrella JIRA: YARN-2915
• Includes detailed design proposal and e2e patch
• Federation branch created and API patches posted
• You are welcome to join and contribute 
• Thanks: Wangda, Karthik, Vinod, Jian….
Next
YARN Federation Architecture
by Kishore Chaliparambil
YARN Federation
•Enables applications to scale to 100k of thousands of
nodes
•YARN Resource Manager (RM) is a single instance.
• Scalability of RM is affected by
• Cardinality: |nodes|, |apps|, |tasks|
• Frequency: NM and AM heartbeat intervals, task duration
•YARN is battle-tested on 4-8k nodes
•@Microsoft: >50k node clusters, short lived tasks
•So how does federation work?
Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2
RM
Task
RM
Task
RM
Task
AM RM Proxy Service
(Per Node)Policy StateRouter Service
YARN Client
Federation
Services
YARN
Sub Clusters
Servers in Datacenter
AM
AM
Federation Architecture
• Implements Client-RM Protocol
• Stateless, Scalable Service
• Multiple Instances with Load
Balancer
• Implements AM-RM Protocol
• Hosted in NM
• Intercepts all AM-RM
communications
• Sub-clusters are unmodified standalone
YARN clusters with about 6K nodes.
Start ContainersSubmit App
• Voila! Applications can transparently span
across multiple YARN sub clusters and scale
to Datacenter level
• No code change in any application
• Centralized, highly-available repository
• RDBMS, Zookeeper, HDFS,…
AM RM Proxy Service Internals
Node Manager
AM RM Proxy Service
Application Master
Per Application Pipeline (Interceptor Chain)
Federation Interceptor
Security/Throttling Interceptor
…
Home RM Proxy
Unmanaged AM
SC #2
Unmanaged AM
SC #3
SC #1 RM SC #2 RM SC#3 RM
• Hosted in NM
• Extensible Design
• DDoS Prevention
• Unmanaged AM used for container negotiation.
They are created on demand based on policy
• Code Committed to 2.8
Policy
Next
Federation Policies
by Carlo Curino
Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2
RM RM RM
AM RM Proxy Service
(Per Node)Policy StateRouter Service
YARN Client
Federation
Services
YARN
Sub Clusters
Servers in Datacenter
Federation: Policy Engine
Policy Engine
Federation
Admin APIs
Flexible policies
• Manually curated (to start)
• Automatically generated (later)
General enforcement mechanisms:
• Router
• AMRMProxy
• RM Schedulers
Federation Policies
Goal: efficiently operate a federated cluster
• Complex trade-offs: load balancing, scaling, global-invariants (fairness), tenant
isolation, fault-tolerance,…
Policies
• Input: user, reservation, queue, node labels, ResourceRequest, …
• State information: sub-clusters load, planned maintenance,…
• Output: routing/scheduling decisions (that determine all container allocations)
Tackling hard problems with policies
SC1 SC2 SC3 SC4
? ? ? ?
Global queue structure
Local enforcement
A hard problem:
How to transparently enable “global queues” via “local enforcement”?
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
Spectrum of options: Full Partitioning
SC3SC2SC1 SC4
Policies: Router and AMRMProxy direct to single RM
Pros: perfect scale-out, isolation
Cons: fragmentation/utilization issues, max-size job, uneven impact of SC failures,…
R
A
A1
100%
100%
40% 60%A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
B
100%
100%
R
C
100%
100%
R
C
100%
D 100%
Spectrum of options: Full Replication
SC4SC1 SC2 SC3
Policies: Router (round-robin/random), and AMRMProxy fwd to RMs based on
locality of Resource Request
Pros: simple, symmetric, fair (if all jobs broadcast demand), resilient
Cons: scalability in #jobs, …  (heuristics improvements)
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
Spectrum of options: Dynamic Partial Replication
SC3SC2SC1 SC4
Policies: Router (round-robin/random on subset of RMs), and AMRMProxy fwd to
RMs based on locality of ResourceRequest (on subset of RMs)
Pros: trade-off between advantages of replication/partitioning
Cons: complexity / rebalancing  could use dynamic approach
R
A B C
A1
100%
25%25%25%
40% 60%
D 25%
A2
R
A
A1
100%
50%
80%
D 50%
R
A
100%
50%
100%
C 50%
A2
R
B
100%
100%
R
C
100%
50% D 50%
20%A2
Demo
Show basic job running across sub-clusters
Show some UIs and ops commands
Showcase user-based, partially-replicated, routing policy
• Router: random-weighted among a set of sub-clusters…
• AMRMProxy: broadcast request to set of sub-clusters…
Next
YARN Federation Demo
by Giovanni Fumarola

More Related Content

What's hot

Venturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVenturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVARUN SAXENA
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkJen Aman
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
Monitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstatsMonitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstatsBen Mildren
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentDataWorks Summit
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 

What's hot (20)

Venturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVenturing into Large Hadoop Clusters
Venturing into Large Hadoop Clusters
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Monitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstatsMonitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstats
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 

Viewers also liked

The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit
 
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...DataWorks Summit/Hadoop Summit
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Precisely
 

Viewers also liked (20)

Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Workload Automation + Hadoop?
Workload Automation + Hadoop?Workload Automation + Hadoop?
Workload Automation + Hadoop?
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Hadoop YARN Services
Hadoop YARN ServicesHadoop YARN Services
Hadoop YARN Services
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 

Similar to YARN Federation

Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNDataWorks Summit/Hadoop Summit
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...HostedbyConfluent
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plansinside-BigData.com
 
MyHeritage backend group - build to scale
MyHeritage backend group - build to scaleMyHeritage backend group - build to scale
MyHeritage backend group - build to scaleRan Levy
 
Solaris cluster roadshow day 1 technical presentation
Solaris cluster roadshow day 1 technical presentationSolaris cluster roadshow day 1 technical presentation
Solaris cluster roadshow day 1 technical presentationxKinAnx
 
Background scenario drivers and critical issues with a focus on technology ...
Background   scenario drivers and critical issues with a focus on technology ...Background   scenario drivers and critical issues with a focus on technology ...
Background scenario drivers and critical issues with a focus on technology ...bdemchak
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceAnil Nair
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Unit 1
Unit 1Unit 1
Unit 1sasi
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msIlya Ganelin
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure bloomreacheng
 
Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...
Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...
Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...Continuent
 
Adam Nichols 2016_10
Adam Nichols 2016_10Adam Nichols 2016_10
Adam Nichols 2016_10Adam Nichols
 
2. buc od-vl-sparc-today tomorrow-v1.5
2. buc od-vl-sparc-today tomorrow-v1.52. buc od-vl-sparc-today tomorrow-v1.5
2. buc od-vl-sparc-today tomorrow-v1.5Doina Draganescu
 
Adam Nichols 2016_12
Adam Nichols 2016_12Adam Nichols 2016_12
Adam Nichols 2016_12Adam Nichols
 
Service Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand ServicesService Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand ServicesAnil Gursel
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationNitin Sharma
 

Similar to YARN Federation (20)

Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
 
MyHeritage backend group - build to scale
MyHeritage backend group - build to scaleMyHeritage backend group - build to scale
MyHeritage backend group - build to scale
 
Solaris cluster roadshow day 1 technical presentation
Solaris cluster roadshow day 1 technical presentationSolaris cluster roadshow day 1 technical presentation
Solaris cluster roadshow day 1 technical presentation
 
Background scenario drivers and critical issues with a focus on technology ...
Background   scenario drivers and critical issues with a focus on technology ...Background   scenario drivers and critical issues with a focus on technology ...
Background scenario drivers and critical issues with a focus on technology ...
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC Performance
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Unit 1
Unit 1Unit 1
Unit 1
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
 
Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...
Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...
Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...
 
Adam Nichols 2016_10
Adam Nichols 2016_10Adam Nichols 2016_10
Adam Nichols 2016_10
 
2. buc od-vl-sparc-today tomorrow-v1.5
2. buc od-vl-sparc-today tomorrow-v1.52. buc od-vl-sparc-today tomorrow-v1.5
2. buc od-vl-sparc-today tomorrow-v1.5
 
Adam Nichols 2016_12
Adam Nichols 2016_12Adam Nichols 2016_12
Adam Nichols 2016_12
 
Service Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand ServicesService Stampede: Surviving a Thousand Services
Service Stampede: Surviving a Thousand Services
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

YARN Federation

  • 1. YARN Federation (YARN-2915) Subru Krishnan, Kishore Chaliparambil, Carlo Curino, and Giovanni Fumarola Microsoft
  • 2. Who are we? Large team: • Cloud and Information Services Lab (CISL) • Applied research group in large-scale systems and machine learning • BigData Resource Management team • Design, build and operate Microsoft’s big data infrastructure
  • 3. Agenda • YARN @MS • Federation Architecture • Policy space • Demo
  • 4. YARN @MS Familiar Challenges: • Diverse workloads (batch, interactive, services,…) • Support for production SLAs • ROI on cluster investments (utilization) Special Challenges: • Leverage existing strong infrastructure (Cosmos/Scope/REEF/Azure) • Enable all OSS technologies • Scale of first-party clusters (each can exceed 50k nodes) • Public Cloud (security, number of tenants, service integration…) Big Bet: Unified Resource Management through YARN (OSS) + Azure +
  • 5. YARN @MS: Innovate and Contribute Problems • Lack of SLAs for production jobs • High utilization for a broad range of workloads • YARN scalability, • Private cloud (from disjoint clusters) • Cross-DC? Our Solution… • Rayon: resource reservation framework (YARN-1051) • Mercury: introduce container types and node-level queueing (YARN- 2877) • Federation: “federate” multiple YARN clusters (YARN-2915)
  • 6. YARN Federation in Apache • Umbrella JIRA: YARN-2915 • Includes detailed design proposal and e2e patch • Federation branch created and API patches posted • You are welcome to join and contribute  • Thanks: Wangda, Karthik, Vinod, Jian….
  • 7. Next YARN Federation Architecture by Kishore Chaliparambil
  • 8. YARN Federation •Enables applications to scale to 100k of thousands of nodes •YARN Resource Manager (RM) is a single instance. • Scalability of RM is affected by • Cardinality: |nodes|, |apps|, |tasks| • Frequency: NM and AM heartbeat intervals, task duration •YARN is battle-tested on 4-8k nodes •@Microsoft: >50k node clusters, short lived tasks •So how does federation work?
  • 9. Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2 RM Task RM Task RM Task AM RM Proxy Service (Per Node)Policy StateRouter Service YARN Client Federation Services YARN Sub Clusters Servers in Datacenter AM AM Federation Architecture • Implements Client-RM Protocol • Stateless, Scalable Service • Multiple Instances with Load Balancer • Implements AM-RM Protocol • Hosted in NM • Intercepts all AM-RM communications • Sub-clusters are unmodified standalone YARN clusters with about 6K nodes. Start ContainersSubmit App • Voila! Applications can transparently span across multiple YARN sub clusters and scale to Datacenter level • No code change in any application • Centralized, highly-available repository • RDBMS, Zookeeper, HDFS,…
  • 10. AM RM Proxy Service Internals Node Manager AM RM Proxy Service Application Master Per Application Pipeline (Interceptor Chain) Federation Interceptor Security/Throttling Interceptor … Home RM Proxy Unmanaged AM SC #2 Unmanaged AM SC #3 SC #1 RM SC #2 RM SC#3 RM • Hosted in NM • Extensible Design • DDoS Prevention • Unmanaged AM used for container negotiation. They are created on demand based on policy • Code Committed to 2.8 Policy
  • 12. Yarn Sub-Cluster #1 Yarn Sub-Cluster #3Yarn Sub-Cluster #2 RM RM RM AM RM Proxy Service (Per Node)Policy StateRouter Service YARN Client Federation Services YARN Sub Clusters Servers in Datacenter Federation: Policy Engine Policy Engine Federation Admin APIs Flexible policies • Manually curated (to start) • Automatically generated (later) General enforcement mechanisms: • Router • AMRMProxy • RM Schedulers
  • 13. Federation Policies Goal: efficiently operate a federated cluster • Complex trade-offs: load balancing, scaling, global-invariants (fairness), tenant isolation, fault-tolerance,… Policies • Input: user, reservation, queue, node labels, ResourceRequest, … • State information: sub-clusters load, planned maintenance,… • Output: routing/scheduling decisions (that determine all container allocations)
  • 14. Tackling hard problems with policies SC1 SC2 SC3 SC4 ? ? ? ? Global queue structure Local enforcement A hard problem: How to transparently enable “global queues” via “local enforcement”? R A B C A1 100% 25%25%25% 40% 60% D 25% A2
  • 15. Spectrum of options: Full Partitioning SC3SC2SC1 SC4 Policies: Router and AMRMProxy direct to single RM Pros: perfect scale-out, isolation Cons: fragmentation/utilization issues, max-size job, uneven impact of SC failures,… R A A1 100% 100% 40% 60%A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R B 100% 100% R C 100% 100% R C 100% D 100%
  • 16. Spectrum of options: Full Replication SC4SC1 SC2 SC3 Policies: Router (round-robin/random), and AMRMProxy fwd to RMs based on locality of Resource Request Pros: simple, symmetric, fair (if all jobs broadcast demand), resilient Cons: scalability in #jobs, …  (heuristics improvements) R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A B C A1 100% 25%25%25% 40% 60% D 25% A2
  • 17. Spectrum of options: Dynamic Partial Replication SC3SC2SC1 SC4 Policies: Router (round-robin/random on subset of RMs), and AMRMProxy fwd to RMs based on locality of ResourceRequest (on subset of RMs) Pros: trade-off between advantages of replication/partitioning Cons: complexity / rebalancing  could use dynamic approach R A B C A1 100% 25%25%25% 40% 60% D 25% A2 R A A1 100% 50% 80% D 50% R A 100% 50% 100% C 50% A2 R B 100% 100% R C 100% 50% D 50% 20%A2
  • 18. Demo Show basic job running across sub-clusters Show some UIs and ops commands Showcase user-based, partially-replicated, routing policy • Router: random-weighted among a set of sub-clusters… • AMRMProxy: broadcast request to set of sub-clusters…
  • 19. Next YARN Federation Demo by Giovanni Fumarola