Rohith Sharma, Naganarasimha &
Rohith Sharma K S,
-Hadoop Committer, Works for Huawei
-5+ year of experience in Hadoop ecosystems
Naganarasimha G R,
-Apache Hadoop Contributor for YARN, Huawei
-4+ year of experience in Hadoop ecosystems
-Apache Hadoop Contributor for YARN and MapReduce
-3+ year of experience in Hadoop ecosystems
➔Overview about general cluster deployment
➔Yarn cluster resource configurations walk through
● RM Restart/HA
● Queue Planning
Brief Overview: General Cluster Deployment
A sample Hadoop Cluster Layout with HA
ATS RM - Resource Manager
NM - Node Manager
NN - Name Node
DN - Data Node
ATS - Application Timeline Server
ZK - ZooKeeper
YARN Configuration : An Example
Legacy NodeManager’s or DataNode’s were having low resource configurations. Nowadays most of the
systems has high end capability and customers wants high end machines with less number of nodes
(50~100 nodes) to achieve better performance.
Sample NodeManager configurations could be like:
-64 GB in Memory
-8/16 cores of CPU
-1Gb Network cards
-100 TB disk (or Disk Arrays)
We are now more focussing on these set of deployment and will try to cover anti-patterns OR best
usages in coming slides.
YARN Configuration: Related to Resources
YARN and MR has these various resource tuning configurations to help for a better resource
●With “vmem-pmem-ratio” (2:1 for example), Node Manager can kill container if its Virtual
Memory shoots twice to its configured memory usage.
●It’s advised to configure “local-dirs” and “log-dirs” in different mount points.
Container Memory Vs Container Heap Memory
Customer : “Enough container memory is configured, still job runs slowly and sometimes
when data is relatively more, tasks fails with OOM”
1.Container memory and container Heap Size both are different configurations.
2.Make sure if mapreduce.map/reduce.memory.mb is configured then configure
mapreduce.map/reduce.java.opts for heap size.
3.Since this was common mistake from users, currently in trunk we have handled this scenario. RM will
set 0.8 of container configured/requested memory as its heap memory.
1. if mapreduce.map/reduce.memory.mb values are specified, but no -Xmx is supplied for
mapreduce.map/reduce.java.opts keys, then the -Xmx value will be derived from the former's value.
2. For both these conversions, a scaling factor specified by property mapreduce.job.heap.memory-
mb.ratio is used (default 80%), to account for overheads between heap usage vs. actual physical
Shuffle phase is taking long time
Customer: “500 GB data Job finished in 4 hours, and on the cluster 1000 GB data
job is running since 12 hours in reducer phase. I think job is stuck.”
After enquiring more about resource configuration,
The same resource configurations used for both the jobs
1.Job is NOT hanged/stuck, rather time has spent on copying map output.
2.Increase the task resources
RM Restart : RMStateStore Limit
Customer: “Configured to yarn.resourcemanager.max-completed-applications to 100000.
Completed applications in cluster has reached the limit and there many applications are in
running. Observation is RM service to be up, takes 10-15 seconds”
1.It is NOT suggested to configure 100000 max-completed-applications.
2.Suggested to use TimelimeServer for history of YARN applications
3.Higher the value significantly impact on the RM recovery
Queue planning : Queue Capacity Planning and Preemption
Queue planning : Queue Capacity Planning for multiple users
Customer : “I have multiple users submitting apps to a queue, seems like all the resources have
been taken by single user’s app(s) though other apps are activated“
Queue Capacity Planning :
CS provides options to control resources used by different users under a queue. yarn.scheduler.capacity.<queue-
path>.minimum-user-limit-percent and yarn.scheduler.capacity.<queue-path>.user-limit-factor are the configurations which
determines what amount of resources each user gets
yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent defaults to 100% which implies no user limits are imposed.
This defines how much minimum resource each user is going to get.
yarn.scheduler.capacity.<queue-path>.user-limit-factor defaults to 1 which implies that a single user can never take complete
queue’s resources. Needs to be configured such that even when other users are not using the queue, how much a particular
user can take.
Queue planning : AM Resource Limit
Customer: “Hey buddy, most of my Jobs are in ACCEPTED state and never starts to run.
What should be the problem?”
“All my Jobs were running fine. But after RM switchover, few Jobs didn’t resume its work.
Why RM is not able to allocate new containers to these Jobs?”
1.User need to ensure that AM Resource Limit is properly configured w.r.t the User’s deployment needs.
Maximum resource limit for running AM containers need to be analyzed and configured correctly to
ensure effective progress of applications.
a. Refer yarn.scheduler.capacity.maximum-am-resource-percent
2.After RM switchover if few NMs were not registered back, it can result a change in cluster size
compared to what was there prior to failover. This will affect the AM Resource Limit, and hence less AMs
will be activated after restart.
3.For analytical : more AM limit, For Batch queries : less AM limit
Queue planning : Application Priority within Queue
Customer : “I have many applications running in my cluster, and few are very important jobs
which has to execute fast. I now use separate queues to run some very important
applications. Configuration seems very complex here and I feel cluster resources are not
utilized well because of this.”
sales (50%) inventory(50%)
Configuration seems very complex for this case and
cluster resources may not be utilized very well.
Suggesting to use Application Priority instead.
Application Priority will be available in YARN from 2.8 release onwards. A brief heads-up
about this feature.
1.Configure “yarn.cluster.max-application-priority” in yarn-site.xml. This will be the maximum
priority for any user/application which can be configured.
2.Within a queue, currently applications are selected by using OrderingPolicy (FIFO/Fair). If
applications are submitted with priority, Capacity Scheduler will also consider prioirity of
application in FiFoOrderingPolicy. Hence an application with highest priority will always be
picked for resource allocation.
3.For MapReduce, use “mapreduce.job.priority” to set priority.
Application Priority within Queue
Resource Request Limits
Customer: “I am not very sure about the capacity of node managers and maximum-allocation
resource configuration. But my application is not getting any containers or its getting killed.”
NMs are not having more than 6GB memory. If container request has big memory/cpu demand which
may more than a node manager’s memory and less than default “maximum-allocation-mb”, then
container requests will not be served by RM. Unfortunately this is not thrown as an error to the user side,
and application will continuously wait for allocation. On the other hand, Scheduler will also be waiting for
some nodes to meet this heavy resource requests.
User yarn.scheduler.maximum-allocation-mb and yarn.scheduler.maximum-allocation-vcores effectively by looking up
on the NodeManager memory/cpu limit.
Customer : “My Application has reserved container in a node and never able to get new
Reservation feature in Capacity Scheduler serves a great deal to ensure a better linear resource
allocation. However it’s possible that there can be few corner cases. For example, an application has
made a reservation to a node. But this node has various containers running (long-lived), so chances of
getting some free resources from this node is minimal in an immediate time frame.
Configurations like below can help in having some time-framed reservation for effective cluster usage.
●yarn.scheduler.capacity.reservations-continue-look-all-nodes will help in looking for a suitable resource in other