SlideShare a Scribd company logo
1 of 38
HADOOP 2.0 YARN ARCHITECTURE
Sharad Kumar
Nandan Kumar
1
Session Objectives
 Introduction to BIG Data and Hadoop
 Understanding Hadoop 2.0 and its features
 Understanding the differences between Hadoop 1.x and Hadoop 2.x
 Understanding YARN
 Working of Application Master
 Scheduling In YARN
 Scheduling Mechanisms in YARN
 Q & A
2
Introduction to Big Data and Hadoop
Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
Systems/Enterprises generate huge amount of data from terabytes to
even petabytes/zettabytes of information.
It’s very difficult to manage such huge data…
HADOOP

BIG DATA
& 3
Big Data and its challenges
Challenges of processing Big Data are 3 V’s.
VOLUME VELOCITY VARIETY
Modern systems have
Much more data.
- Terabytes + a day.
- Petabytes + total
We need a new
approach.
To Process such a huge
volume of data within a
specified time period, We
need a new approach .
We have to process different
sorts of data such as
Structured, Semi-structured,
and Unstructured data. We
need a new approach.
4
What is Hadoop ?
Apache Hadoop is a framework that allows the distributed processing of large data
sets across clusters of commodity computers using a simple programming model.
It is an open-source data management technology with scale-out storage and
distributed processing.
5
Hadoop Ecosystem
6
Background : Hadoop + HDFS
 Every node contributes part of
its local file System to HDFS.
 Tasks can only depend on the
local file system
(JVM class path does not
understand HDFS Protocol)
HDFS Distributed File System
NameNode
DataNode DataNode
Local File
System
Local File
System
7
Hadoop 1.x Architecture
8
Challenges for Hadoop 1.x
Problem Description
NameNode – No Horizontal
scalability
Single NameNode and single Namespaces
limited by NameNode RAM.
NameNode – No High Availability
(HA)
NameNode is single point of failure, need
manual recovery using secondary
NameNode in case of failure.
Job tracker – overburdened Spends significant amount of time and
effort managing the life-cycle of
applications.
MRv1 – only Map and Reduce tasks Humongous amount of data stored in
HDFS remains unutilized and cannot be
used for other workloads such as graph
processing etc.
9
Hadoop 2.x Features
Property Hadoop 1.0 Hadoop 2.0
Federation One Namenode and
Namespaces
Multiple Namenode and
Namespaces
High Availability Not present Highly Available
YARN – Processing
control and Multi tenancy
Job Tracker, Task Tracker Resource Manager, Node
Manager, App Master,
Capacity Scheduler
10
Other Important Hadoop 2.0 Features
 Hadoop Snapshots
 NFSv3 access to data in HDFS
 Support for running Hadoop on MS Windows
 Binary compatibility for MapReduce Applications built on Hadoop 1.0
Hadoop 2.x High Availability
11
HDFS 1.x vs 2.x
12
Hadoop Map Reduce Classic
 JobTracker
 Manages cluster resources
and Job scheduling
 TaskTracker
 Per-node Agent
 Manage Tasks
13
YARN
 Yet Another Resource Negotiator
 YARN Application Resource Negotiator (Recursive
Acronym)
 Remedies the scalability shortcomings of “classic”
MapReduce
 Classic MapReduce has scalability issues around 4000 nodes
and higher
 Is more of a general purpose framework of which classic
MapReduce is one application.
14
Classic MapReduce vs. YARN
 Fault Tolerance and Availability
 Resource Manager
 No single point of failure – state saved in ZooKeeper
 Application Masters are restarted automatically on RM restart
 Application Master
 Optional failover via application-specific checkpoint
 MapReduce applications pick up where they left off via state saved in HDFS
 Wire Compatibility
 Protocols are wire-compatible
 Old clients can talk to new servers
 Rolling upgrades
15
Classic MapReduce vs. YARN
 Support for programming paradigms other than MapReduce (Multi
tenancy)
• Tez – Generic framework to run a complex DAG
• HBase on YARN(HOYA)
• Machine Learning: Spark
• Graph processing: Giraph
• Real-time processing: Storm
• Enabled by allowing the use of paradigm-specific application
master
• Run all on the same Hadoop cluster!
16
YARN Architectural Overview
 Scalability – Clusters of 6000
– 10000 machines
 Each machine with 16 cores ,
48GB/96GB RAM,
24TB/36TB Hard Disks.
 100,000 + Concurrent tasks
 10000 concurrent jobs
17
YARN Architectural Overview(Contd..)
 Splits up the two major functions of JobTracker
 Global Resource Manager - Cluster resource management
 Application Master - Job scheduling and monitoring (one per application).
The Application Master negotiates resource containers from the Scheduler,
tracking their status and monitoring for progress. Application Master itself
runs as a normal container.
 Tasktracker
 NodeManager (NM) - A new per-node slave is responsible for launching
the applications’containers, monitoring their resource usage (cpu, memory,
disk, network) and reporting to the Resource Manager.
 YARN maintains compatibility with existing MapReduce applications and
users.
18
YARN Flow
YARN = YET ANOTHER RESOURCE NEGOTIATOR
Resource Manager
 Cluster level Resource Manager
 Long Life, High quality hardware
Node Manager
 One per Data Node
 Monitor resources on Data Node
Application Master
 One per Data Node
 Short Life
 Manages Task/Scheduling
19
YARN – How It Works
Protocols :
1.) Client – RM: Submit the App Master
2.) RM – NM: Start the App Master
3.) AM – RM: Request + Release containers
4.) RM – NM: Start tasks in containers
YARN
Client
YARN
Resource Manager
Node Manager
Node Manager
Task
AM
Node Manager
Task
Task
Task Task
1.)
2.)
3.)
4.)
20
YARN – Application Master
 Once We have an Application Master running in a container.
 Next, the Master starts the application’s task in containers.
 For each task , similar procedure as starting the master.
21
YARN – Application Master (contd..)
1) Connect with Resource Manager and register itself
2) Loop:
 Send request
a) Send heartbeat
b) Request containers
c) Release containers
 Receive response from Resource Manager
a) Receive containers
b) Receive notification of containers that terminated
 For each container received, send request to Node Manager to start task.
22
YARN – Application Master (contd..)
 Master Should terminate after all containers have terminated.
• If master crashes (or fails to send heart beat), all containers are killed.
• In the future, YARN will support to restart the master
 Starting a task in a container
• Very similar to YARN Client starting the master.
• Node manager starts and monitors task.
• If task crashes, Node manager informs Resource Manager.
• Resource Manager informs master in the next response.
• Master must still release the container.
23
YARN – Application Master (contd..)
 Resource Manager assigns containers asynchronously
• Requested containers are returned at the earliest in the next
call.
• Master must send empty requests until all it receives all
containers.
• Subsequent requests are incremental and can ask for additional
containers
• Master must keep track of what it has requested and received
24
YARN – Shortcomings
 Complexity
• Protocols are at very low level and very verbose.
• Client must prepare all dependent jars on HDFS.
• Client must set up environment, class path etc.
 Logs are only collected after application terminates
• What about long-running apps?
 Applications don’t survive master crash
 No built in communication between containers and masters.
 Hard to debug
25
Scheduling In YARN
 Ideal situation the requests that YARN application makes would
be granted immediately
 Real world resources are limited and on a busy cluster an
application will often need to wait to have some of its requests
fulfilled.
 YARN takes the responsibility of providing resources to
applications according to some defined policies
 Scheduling is difficult problem and there is no one “best” policy,
which is why YARN provides a choice of schedulers and
configurable policies. 26
Scheduling In YARN
 Scheduler Options
 FIFO Scheduler
 Capacity Scheduler
 Fair Scheduler
27
Scheduling In YARN – FIFO Scheduler
 FIFO Scheduler: It places applications in a queue and runs them in the order of
submission (First In First Out). Requests for the first application in the queue
are allocated first. It is an simple implementation but not suitable for shared
cluster because large applications will all the resources and others needs to
wait in the queue.
28
Scheduling In YARN – FIFO Scheduler
The FIFO queue scheduler runs jobs based on the order in which the jobs were submitted.
29
Scheduling In YARN - Capacity Scheduler
 Capacity Scheduler:
 In a shared cluster each organization is allocated certain capacity of overall cluster.
 Each organization is set up with dedicated queue that is configured to use given fraction
of the cluster capacity.
 Queues may be further divided into hierarchical fashion allowing each organization to
share its cluster allowance between different groups of user within the organization.
 If there is more than one job in the queue and there are idle resources available, then with
the capacity scheduler a separate dedicated queue allows the small job to start as soon as
it submitted, although its at the cost of overall cluster utilization.
 Some times these queue allotments can be done beyond the specified queue capacity but
not beyond the maximum capacity of the parent queue. This is called a queue elasticity
30
Scheduling In YARN - Capacity Scheduler
31
Scheduling In YARN - Capacity Scheduler
The Capacity Scheduler is a scheduler for Hadoop that allows multiple tenants to securely share a
large cluster. Resources are allocated to each tenant's applications in a way that fully utilizes the
cluster. Free resources can be allocated to any queue beyond its capacity allocation.
32
Scheduling In YARN – Fair Scheduler
 Fair Scheduler:
 Fair scheduler is also used in the shared cluster environment
 With Fair Scheduler there is no need to reserve a set amount of capacity, since it
will dynamically balance resources between all running jobs.
 Just after the first (large) job starts, it is the only job running, so it gets all the
resources in the cluster. When the second (small) job starts, it is allocated half of
the cluster recourses so that each job is using its fair share of resources.
 After the small job completes and no longer requires resources, the large job goes
back to using the full cluster capacity again.
 It’s a rule/policy based scheduler. In the configuration file(xml) queue
configurations are defined with the set of rules and policies.
 Under the configuration pre-emptions min/max limits and minimum guaranteed
share can be configured.
33
Scheduling In YARN – Fair Scheduler
34
Scheduling In YARN – Fair Scheduler
The design goal of the Fair Scheduler is to assign resources to jobs so that each job receives its
fair share of resources over time. The Fair Scheduler enforces fair sharing within each queue.
35
Scheduling Mechanisms In YARN
 Delay Scheduler:
 All YARN scheduler try to honor locality requests
 On a busy cluster if an application requests a particular node, there is a good chance that other containers
are running on the same anode and the node wont be allocated.
 The obvious course of action is to immediately loosen the locality requirement and allocate container on
some other node (same rack, different rack, different data center)
 However waiting for sometime can increase the possibility to get the requested node. This is called Delay
scheduling and its supported by both Capacity and Fair Scheduler.
 When using the delay scheduling, the scheduler doesn’t simply use the first scheduling opportunity it
receives, but waits for up to given maximum number of scheduling opportunities to occur before
loosening the locality constraint and taking the next scheduling opportunity.
 For capacity scheduler delay scheduling is configured by setting yarn.scheduler.capacity.node-locality-
delay
 The fair scheduler also uses the number of scheduling opportunities to determine the delay, although it is
expressed as a portion of cluster size. For example setting yarn.scheduler.fair.locality.threshold.node to 0.5
means that the scheduler should wait until half of the nodes in the cluster have presented scheduling
opportunities before accepting another node.
36
Scheduling Mechanisms In YARN
 Dominant Resource Fairness:
 When there is only a single resource type being scheduled, such as memory, then the concept of capacity or
fairness is easy to determine. However when there are multiple resources in play, things get more
complicated.
 The way that the schedulers in YARN address this problem is to look at each usrs dominant resources and
use it ad a measure of the cluster usage. This approach is called as DRF (Dominant Resource Fairness)
 By Default DRF is not used. But can be configured for different types of schedulers for example: capacity
schedulers can be configured to use DRF by setting yarn.scheduler.capacity.resource-calculator to
org.apache.Hadoop.yarn.util.resource.DominantResourceCalculator in capacity-scheduler.xml
 Example:
Imagine a cluster of 100 CPU’s and 10 Tb of memory. Application A requests for container of (2 CPU,
300 GB) and application B requests containers of (6 CPU, 100 GB). A’s request is (2%, 3%) and B’s
request is (6%, 1%). While comparing the dominant resource request for A and B (3% versus 6%) B
wins the DRF game and gets the maximum allocation (at least the half)
37
Questions
38

More Related Content

What's hot

Hadoop 2.0 YARN webinar
Hadoop 2.0 YARN webinar Hadoop 2.0 YARN webinar
Hadoop 2.0 YARN webinar Abhishek Kapoor
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnOmid Vahdaty
 
Introduction to YARN Apps
Introduction to YARN AppsIntroduction to YARN Apps
Introduction to YARN AppsCloudera, Inc.
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Arinto Murdopo
 
Reservations Based Scheduling: if you’re late don’t blame us!
Reservations Based Scheduling: if you’re late don’t blame us!  Reservations Based Scheduling: if you’re late don’t blame us!
Reservations Based Scheduling: if you’re late don’t blame us! DataWorks Summit
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARNArinto Murdopo
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersDataWorks Summit
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopHortonworks
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...StampedeCon
 

What's hot (20)

Hadoop 2.0 YARN webinar
Hadoop 2.0 YARN webinar Hadoop 2.0 YARN webinar
Hadoop 2.0 YARN webinar
 
Hadoop YARN overview
Hadoop YARN overviewHadoop YARN overview
Hadoop YARN overview
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Introduction to YARN Apps
Introduction to YARN AppsIntroduction to YARN Apps
Introduction to YARN Apps
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN
 
Reservations Based Scheduling: if you’re late don’t blame us!
Reservations Based Scheduling: if you’re late don’t blame us!  Reservations Based Scheduling: if you’re late don’t blame us!
Reservations Based Scheduling: if you’re late don’t blame us!
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARN
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 

Similar to Hadoop 2.0 yarn arch training

HADOOP_2_0_YARN_Arch - Copy.pptx
HADOOP_2_0_YARN_Arch - Copy.pptxHADOOP_2_0_YARN_Arch - Copy.pptx
HADOOP_2_0_YARN_Arch - Copy.pptxBirajPaul6
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Hakka Labs
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupRommel Garcia
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
A sdn based application aware and network provisioning
A sdn based application aware and network provisioningA sdn based application aware and network provisioning
A sdn based application aware and network provisioningStanley Wang
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Uwe Printz
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsrishavkumar1402
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsRUHULAMINHAZARIKA
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReducehuguk
 
堵俊平:Hadoop virtualization extensions
堵俊平:Hadoop virtualization extensions堵俊平:Hadoop virtualization extensions
堵俊平:Hadoop virtualization extensionshdhappy001
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...Govt.Engineering college, Idukki
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceUwe Printz
 

Similar to Hadoop 2.0 yarn arch training (20)

HADOOP_2_0_YARN_Arch - Copy.pptx
HADOOP_2_0_YARN_Arch - Copy.pptxHADOOP_2_0_YARN_Arch - Copy.pptx
HADOOP_2_0_YARN_Arch - Copy.pptx
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
A sdn based application aware and network provisioning
A sdn based application aware and network provisioningA sdn based application aware and network provisioning
A sdn based application aware and network provisioning
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Introduction to yarn
Introduction to yarnIntroduction to yarn
Introduction to yarn
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data Analytics
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
堵俊平:Hadoop virtualization extensions
堵俊平:Hadoop virtualization extensions堵俊平:Hadoop virtualization extensions
堵俊平:Hadoop virtualization extensions
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 

Recently uploaded (20)

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 

Hadoop 2.0 yarn arch training

  • 1. HADOOP 2.0 YARN ARCHITECTURE Sharad Kumar Nandan Kumar 1
  • 2. Session Objectives  Introduction to BIG Data and Hadoop  Understanding Hadoop 2.0 and its features  Understanding the differences between Hadoop 1.x and Hadoop 2.x  Understanding YARN  Working of Application Master  Scheduling In YARN  Scheduling Mechanisms in YARN  Q & A 2
  • 3. Introduction to Big Data and Hadoop Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Systems/Enterprises generate huge amount of data from terabytes to even petabytes/zettabytes of information. It’s very difficult to manage such huge data… HADOOP  BIG DATA & 3
  • 4. Big Data and its challenges Challenges of processing Big Data are 3 V’s. VOLUME VELOCITY VARIETY Modern systems have Much more data. - Terabytes + a day. - Petabytes + total We need a new approach. To Process such a huge volume of data within a specified time period, We need a new approach . We have to process different sorts of data such as Structured, Semi-structured, and Unstructured data. We need a new approach. 4
  • 5. What is Hadoop ? Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an open-source data management technology with scale-out storage and distributed processing. 5
  • 7. Background : Hadoop + HDFS  Every node contributes part of its local file System to HDFS.  Tasks can only depend on the local file system (JVM class path does not understand HDFS Protocol) HDFS Distributed File System NameNode DataNode DataNode Local File System Local File System 7
  • 9. Challenges for Hadoop 1.x Problem Description NameNode – No Horizontal scalability Single NameNode and single Namespaces limited by NameNode RAM. NameNode – No High Availability (HA) NameNode is single point of failure, need manual recovery using secondary NameNode in case of failure. Job tracker – overburdened Spends significant amount of time and effort managing the life-cycle of applications. MRv1 – only Map and Reduce tasks Humongous amount of data stored in HDFS remains unutilized and cannot be used for other workloads such as graph processing etc. 9
  • 10. Hadoop 2.x Features Property Hadoop 1.0 Hadoop 2.0 Federation One Namenode and Namespaces Multiple Namenode and Namespaces High Availability Not present Highly Available YARN – Processing control and Multi tenancy Job Tracker, Task Tracker Resource Manager, Node Manager, App Master, Capacity Scheduler 10 Other Important Hadoop 2.0 Features  Hadoop Snapshots  NFSv3 access to data in HDFS  Support for running Hadoop on MS Windows  Binary compatibility for MapReduce Applications built on Hadoop 1.0
  • 11. Hadoop 2.x High Availability 11
  • 12. HDFS 1.x vs 2.x 12
  • 13. Hadoop Map Reduce Classic  JobTracker  Manages cluster resources and Job scheduling  TaskTracker  Per-node Agent  Manage Tasks 13
  • 14. YARN  Yet Another Resource Negotiator  YARN Application Resource Negotiator (Recursive Acronym)  Remedies the scalability shortcomings of “classic” MapReduce  Classic MapReduce has scalability issues around 4000 nodes and higher  Is more of a general purpose framework of which classic MapReduce is one application. 14
  • 15. Classic MapReduce vs. YARN  Fault Tolerance and Availability  Resource Manager  No single point of failure – state saved in ZooKeeper  Application Masters are restarted automatically on RM restart  Application Master  Optional failover via application-specific checkpoint  MapReduce applications pick up where they left off via state saved in HDFS  Wire Compatibility  Protocols are wire-compatible  Old clients can talk to new servers  Rolling upgrades 15
  • 16. Classic MapReduce vs. YARN  Support for programming paradigms other than MapReduce (Multi tenancy) • Tez – Generic framework to run a complex DAG • HBase on YARN(HOYA) • Machine Learning: Spark • Graph processing: Giraph • Real-time processing: Storm • Enabled by allowing the use of paradigm-specific application master • Run all on the same Hadoop cluster! 16
  • 17. YARN Architectural Overview  Scalability – Clusters of 6000 – 10000 machines  Each machine with 16 cores , 48GB/96GB RAM, 24TB/36TB Hard Disks.  100,000 + Concurrent tasks  10000 concurrent jobs 17
  • 18. YARN Architectural Overview(Contd..)  Splits up the two major functions of JobTracker  Global Resource Manager - Cluster resource management  Application Master - Job scheduling and monitoring (one per application). The Application Master negotiates resource containers from the Scheduler, tracking their status and monitoring for progress. Application Master itself runs as a normal container.  Tasktracker  NodeManager (NM) - A new per-node slave is responsible for launching the applications’containers, monitoring their resource usage (cpu, memory, disk, network) and reporting to the Resource Manager.  YARN maintains compatibility with existing MapReduce applications and users. 18
  • 19. YARN Flow YARN = YET ANOTHER RESOURCE NEGOTIATOR Resource Manager  Cluster level Resource Manager  Long Life, High quality hardware Node Manager  One per Data Node  Monitor resources on Data Node Application Master  One per Data Node  Short Life  Manages Task/Scheduling 19
  • 20. YARN – How It Works Protocols : 1.) Client – RM: Submit the App Master 2.) RM – NM: Start the App Master 3.) AM – RM: Request + Release containers 4.) RM – NM: Start tasks in containers YARN Client YARN Resource Manager Node Manager Node Manager Task AM Node Manager Task Task Task Task 1.) 2.) 3.) 4.) 20
  • 21. YARN – Application Master  Once We have an Application Master running in a container.  Next, the Master starts the application’s task in containers.  For each task , similar procedure as starting the master. 21
  • 22. YARN – Application Master (contd..) 1) Connect with Resource Manager and register itself 2) Loop:  Send request a) Send heartbeat b) Request containers c) Release containers  Receive response from Resource Manager a) Receive containers b) Receive notification of containers that terminated  For each container received, send request to Node Manager to start task. 22
  • 23. YARN – Application Master (contd..)  Master Should terminate after all containers have terminated. • If master crashes (or fails to send heart beat), all containers are killed. • In the future, YARN will support to restart the master  Starting a task in a container • Very similar to YARN Client starting the master. • Node manager starts and monitors task. • If task crashes, Node manager informs Resource Manager. • Resource Manager informs master in the next response. • Master must still release the container. 23
  • 24. YARN – Application Master (contd..)  Resource Manager assigns containers asynchronously • Requested containers are returned at the earliest in the next call. • Master must send empty requests until all it receives all containers. • Subsequent requests are incremental and can ask for additional containers • Master must keep track of what it has requested and received 24
  • 25. YARN – Shortcomings  Complexity • Protocols are at very low level and very verbose. • Client must prepare all dependent jars on HDFS. • Client must set up environment, class path etc.  Logs are only collected after application terminates • What about long-running apps?  Applications don’t survive master crash  No built in communication between containers and masters.  Hard to debug 25
  • 26. Scheduling In YARN  Ideal situation the requests that YARN application makes would be granted immediately  Real world resources are limited and on a busy cluster an application will often need to wait to have some of its requests fulfilled.  YARN takes the responsibility of providing resources to applications according to some defined policies  Scheduling is difficult problem and there is no one “best” policy, which is why YARN provides a choice of schedulers and configurable policies. 26
  • 27. Scheduling In YARN  Scheduler Options  FIFO Scheduler  Capacity Scheduler  Fair Scheduler 27
  • 28. Scheduling In YARN – FIFO Scheduler  FIFO Scheduler: It places applications in a queue and runs them in the order of submission (First In First Out). Requests for the first application in the queue are allocated first. It is an simple implementation but not suitable for shared cluster because large applications will all the resources and others needs to wait in the queue. 28
  • 29. Scheduling In YARN – FIFO Scheduler The FIFO queue scheduler runs jobs based on the order in which the jobs were submitted. 29
  • 30. Scheduling In YARN - Capacity Scheduler  Capacity Scheduler:  In a shared cluster each organization is allocated certain capacity of overall cluster.  Each organization is set up with dedicated queue that is configured to use given fraction of the cluster capacity.  Queues may be further divided into hierarchical fashion allowing each organization to share its cluster allowance between different groups of user within the organization.  If there is more than one job in the queue and there are idle resources available, then with the capacity scheduler a separate dedicated queue allows the small job to start as soon as it submitted, although its at the cost of overall cluster utilization.  Some times these queue allotments can be done beyond the specified queue capacity but not beyond the maximum capacity of the parent queue. This is called a queue elasticity 30
  • 31. Scheduling In YARN - Capacity Scheduler 31
  • 32. Scheduling In YARN - Capacity Scheduler The Capacity Scheduler is a scheduler for Hadoop that allows multiple tenants to securely share a large cluster. Resources are allocated to each tenant's applications in a way that fully utilizes the cluster. Free resources can be allocated to any queue beyond its capacity allocation. 32
  • 33. Scheduling In YARN – Fair Scheduler  Fair Scheduler:  Fair scheduler is also used in the shared cluster environment  With Fair Scheduler there is no need to reserve a set amount of capacity, since it will dynamically balance resources between all running jobs.  Just after the first (large) job starts, it is the only job running, so it gets all the resources in the cluster. When the second (small) job starts, it is allocated half of the cluster recourses so that each job is using its fair share of resources.  After the small job completes and no longer requires resources, the large job goes back to using the full cluster capacity again.  It’s a rule/policy based scheduler. In the configuration file(xml) queue configurations are defined with the set of rules and policies.  Under the configuration pre-emptions min/max limits and minimum guaranteed share can be configured. 33
  • 34. Scheduling In YARN – Fair Scheduler 34
  • 35. Scheduling In YARN – Fair Scheduler The design goal of the Fair Scheduler is to assign resources to jobs so that each job receives its fair share of resources over time. The Fair Scheduler enforces fair sharing within each queue. 35
  • 36. Scheduling Mechanisms In YARN  Delay Scheduler:  All YARN scheduler try to honor locality requests  On a busy cluster if an application requests a particular node, there is a good chance that other containers are running on the same anode and the node wont be allocated.  The obvious course of action is to immediately loosen the locality requirement and allocate container on some other node (same rack, different rack, different data center)  However waiting for sometime can increase the possibility to get the requested node. This is called Delay scheduling and its supported by both Capacity and Fair Scheduler.  When using the delay scheduling, the scheduler doesn’t simply use the first scheduling opportunity it receives, but waits for up to given maximum number of scheduling opportunities to occur before loosening the locality constraint and taking the next scheduling opportunity.  For capacity scheduler delay scheduling is configured by setting yarn.scheduler.capacity.node-locality- delay  The fair scheduler also uses the number of scheduling opportunities to determine the delay, although it is expressed as a portion of cluster size. For example setting yarn.scheduler.fair.locality.threshold.node to 0.5 means that the scheduler should wait until half of the nodes in the cluster have presented scheduling opportunities before accepting another node. 36
  • 37. Scheduling Mechanisms In YARN  Dominant Resource Fairness:  When there is only a single resource type being scheduled, such as memory, then the concept of capacity or fairness is easy to determine. However when there are multiple resources in play, things get more complicated.  The way that the schedulers in YARN address this problem is to look at each usrs dominant resources and use it ad a measure of the cluster usage. This approach is called as DRF (Dominant Resource Fairness)  By Default DRF is not used. But can be configured for different types of schedulers for example: capacity schedulers can be configured to use DRF by setting yarn.scheduler.capacity.resource-calculator to org.apache.Hadoop.yarn.util.resource.DominantResourceCalculator in capacity-scheduler.xml  Example: Imagine a cluster of 100 CPU’s and 10 Tb of memory. Application A requests for container of (2 CPU, 300 GB) and application B requests containers of (6 CPU, 100 GB). A’s request is (2%, 3%) and B’s request is (6%, 1%). While comparing the dominant resource request for A and B (3% versus 6%) B wins the DRF game and gets the maximum allocation (at least the half) 37