As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
Welcome to Introduction to YARN and MR2.This course is for people who are familiar with Hadoop and MapReduce and want to learn about the new MapReduce 2 architecture. We’regoing to talk about what challenges MapReduce 1 has, and why MapReduce 2 and YARN are needed to addressthose issues.Wewill talk about the major differencesbetween Mr1 and mr2. Thenwe’lltake a look at how YARN works to manage resources in a cluster…And how a MapReduce job running on yarnactuallyworks.Wewilltake a quick look at how to manage to a YARN cluster, and how Cloudera Enterprise userscanstartmoving to the new MR2 platform for theirenterprise data hub.
We’ll start with an overview comparison of MR1 and 2.
If you are a current Hadoop user, MapReduce 1 is what you’ve been using all along – “Classic” mapreduce or MR1. You can think of MR1 as consisting of three main components:the API – this is used by programmers to write mapreduce programs…things like mappers, reducers, combiners and all the code that supports mapreduce.the framework – This is the mapreduce “plumbing”…it divides up a job into map and reduce tasks and runs those tasks, including calling the input formats, shuffle and sorting the intermediate data and so on.and…resource management. This is infrastructure where jobs are submitted to a JobTracker, which then distributes the work to task trackers running on nodes in the cluster, monitors the nodes and the process of the jobs and so on.MR2 is often called “next generation” or “nextgen” mapreduce, and it is built on top of YARN – which stands for Yet Another Resource Negotiator. YARN takes over the resource management part of MapReduce, and has its own API…and MapReduce itself now just provide the mapreduce API and framework.
This new architecture was first developed at Yahooo in 2008. Early versions have been available for some time. In version 4 of Cloudera’s distribution including Apache Hadoop, MR2 and yarn were included as a technology preview, but it wasn’t considered ready for prime-time yet.This last summer, YARN was promoted to an official sub-project in Apache Hadoop 2.And in October, Hadoop 2 GA was released, and as of that that release YARN and MR2 are now officially “production ready”. The first Beta release of CDH 5 includes this production ready version.
Now let’s take a look at how this new architecture works, starting with YARN itself.
We’ll start with why we needed yarn in the first place.In MR1, tasks are assigned to nodes according to “slots”, which are either reducer slots or mapper slots. So if node is configured to have 3 mapper slots and 3 reducer slots, and three mappers are already running on that node, no more can run…even if no reducers are running at all, the the machine has plenty of CPU power and memory because the reducer slots aren’t currently running any reducers.Another issue is that other types of applications can run on Hadoop clusters, not just MapReduce. For instance, Impala is a query engine that accesses HDFS directly, not through MapReduce, or Apache Giraph, which is a graph computation framework. In MR1, there’s no way for these “competing” applications that may be running on the same cluster to know about each other and share resources.Finally, the fact that the JobTracker keeps track of all the tasks on all the nodes limits the size of an MR1 cluster to about 4000 nodes.
So how does YARN overcome these issues?For starters, with YARN nodes are no longer configured with specific “slots” for mappers and reducers. instead, nodes have “resources” – that is, memory and cpu – that are allocated to applications when needed. This means better cluster utilization – unused resources are available for any type of task.Another feature of YARN is that it is a generic platform for resource sharing in a cluster, and isn’t specific to MapReduce. This mean other types of applications can run on the same cluster and YARN will manage resources between all of them.And YARN increases the scalability of Hadoop. In MR1 the Job Tracker was a bottleneck. YARN replaces the single Job Tracker with two new types of components: a resource manager, and application masters. The Resource Manager has much less responsibilty and is therefore not nearly as much of a bottleneck.Let’s take a look at those.
Instead of Job Tracker and Task Trackers, Yarn has a Resource Manager and Node Managers.The Resource Manager runs on a master node, one per cluster. It’s in charge of keeping track of the nodes in the cluster and their resources, and scheduling jobs according to the resources they require.Node Managers are daemons running on each slave node in the cluster. They communicate with the resource manager, and run tasks on the node.
When an application runs on YARN, it requests “containers” which really means a collection of resources…a certain amount of memory and cpu cores. The RM allocates containers on a particular node, and the application runs in those containers. Containers take the place of preconfigured “slots” that MR1 uses.The actual application is controlled by an Application Master. Each application – which for MapReduce means each job – has a single app master that requests containers and then runs its tasks in those containers.
So let’s see how this works.This is a basic YARN cluster without any jobs running. The master node runs the RM daemon, and the slave nodes running the NM daemon.Remember that YARN is a generic resource management framework, so we aren’t talking about map reduce – we’ll get to that in a few minutes. This is a concept that applies to any kind of application running on the cluster.
To submit a job, a client passes a “New Application Request” containing information about the application to run to the RM. The RM creates a “container” on one of the slave nodes in which to run the Application Master (AM) for that application, and communicates with the Node Manager on that node to launch it.The RM passes an Application ID back to the client.
Once it’s launched and running, the Application master requests resources from the resource manager…that is, it requsts “containers” with a certain amount of memory and CPU.The RM allocates the requested number of “Containers” on nodes in the cluster…and passes them to the application
The application master then communicates with the node managers on the nodes where the containers are, and instructs them to launch the application’s processes.During execution, the AM communicates with the instances of the app in an application-specific way (not through YARN).
So let’s say one application is running, and now we want to run a second one at the same time.It works the same way: the client tells the resource manager what application to run. The resource manager creates a container on one of the nodes and launches the application master in it.
The second application master requests resources. the resource manager knows what resources are available in the cluster, and allocates containers according to resource availability. Then the second application launches whatever it needs to do in those containers.So that’s a basic overview of how an application runs on a YARN cluster. Let’s take a closer look at a couple of key communication points in this process.
Now let’s talk about how the resource manager actually figures out what resources to allocate where and for which app.Just like in MR1, this is done by a Scheduler. Different schedulers can be plugged into the resource manager to do this. You could create custom schedulers if needed but two are included, which are very similar to the two used in MR2: the capacity scheduler and fair scheduler. If you are familiar with MR1, these function pretty much the same way, except of course it’s more generic to support any YARN application, not just MapReduce.By far the most common one, and the one Cloudera generally recommends, is the fair scheduler. In MR1, the fair scheduler worked with “pools” whereas the Capacity schedule worked with ‘queues’ but they did basically the same thing, so to avoid confusion, in YARN both have the same name: ‘queues’ in both.
One of the big differences in MR1 schedulers and YARN schedulers is support for hierarchical queues. Queues can contain other queues, and share the parent queue’s resources.So let’s say you have two departments using a hadoop cluster: engineering and marketing. As in MR1, a queue’s “weight” can be set to relative to other queues, so in this example engineering gets twice the resources of marketing. Within engineering, they may have two sub-queues for difference products, with product A getting more priority than product B.Marketing takes a difference approach – they have a queue dedicated to handling their website log ETL process, and then another queue for data analysis. Within the data analysis, they have a queue for high priority short running jobs, and then another for regular jobs.
So what else does the resource manager do, other than running the scheduler?It’s tasks include:keeping track of the state of all the nodes in the clusterallocating containers when applications need them, based on available resources on nodes. it also releases the containers when they are no longer needed, or when they expire. This is to make sure that applications don’t request resources and then not use them…if a container isnt used within a certain amount of time, like 10 minutes, the resource manager retires it.creating a container for a new application master itself and telling to node manager how to launch it, and then tracking all the application masters in the clusterAnd it handles cluster-wide security using Kerberos
Meanwhile, the node manager has different responsibilities:when it starts, it registers with the resource manager to report what resources it has available, and sends heartbeats so that the RM can keep track of it.The node manager also who actually launches and tracks the processes on the node. This includes starting the application manager itself when the resource manager says to, and starting individual processes in containers from application masters.the node manager also makes sure that processes in containers stay within their allotted resources, and kills processes that use resources out of control.Another thing it does is aggregate logs. If you’ve used MR1 you know that one of the pain points is working with distribute log files. YARN aggregates the application logs into an HDFS file which can be access by any client.The node manager can also run auxiliary services – extra processes a framework might require. We’ll talk soon about how MR2 uses this for shuffle and sort.Finally, node managers handle node level security using ACLs or access control lists.
Let’s look at one of the key points of communication with the resource manager: the resource request.As we said, the app master requests whatever resources it needs to run the application from the resource manager.In YARN, resources are allocated as containers, so a resource request is essentially a container request. The request includes a resource name, meaning where should the container be located..a specific host, a host in a specific rack, or * for anywhere.It also includes a priority – this is a priority WITHIN the application, not between applications. Meaning that a higher priority container’s resources will be allocated first.And of course, the request includes the amount of resources required. In earlier versions of YARN, that just included how much memory. Now YARN supports the ability to allocate CPU cores as well. And there’s work going on for adding support for additional types of resources such as disk and network IO, but for now it’s just memory and CPU.The resource manager fulfills the requests by allocating containers on nodes will the appropriate resources available, and passes the ID and node location of those containers back to the app master.
Once it’s got that ID and location, the next key communication happens, which is between the application manager and the node manager. This is called a Container Launch Context. This is how the app master tells the node manager how to start up the application within the container…it includes the container ID, the actual commands to start the application, the environment information – that is, the configuration – and any local resources required…such as the binary file containing the actual code for the application, references to HDFS files it will be working with and so on.
Okay, so now we’ve covered how YARN works at a conceptual level so we’ll move on to talking about MapReduce as one specific type of application that can run on YARN.But remember that YARN is a generic framework and supports a growing number of other types of applications. We won’t get into those in detail in this course but you should be aware of them.One is distributedShell. This is a very basic app which is included in Hadoop – it allows running a shell script on multiple nodes. Useful but mostly it’s a proof of concept of YARN.There are other applications too, like Impala, which does near real-time queries on data in HDFS using its own query engine rather than mapreduce…or apache giraph which is a distributed engine for calculating graphs and runs on a hadoop cluster. Spark is an alternative to mapreduce for streaming processing.You can see the list of other applications on the apache website.
Okay now let’s turn our attention to MapReduce itself.
As we said, YARN is a generic resource management framework. MapReduce 2 is the new version of MapReduce…it relies on YARN to do the underlying resource management unlike in MR1. What’s left is the MapReduce API we already know and love, and the framework for running mapreduce applications.In MapReduce 2, each job is a new “application” from the YARN perspective. So each job has its own Application Master. MR2 includes a component for that called MRAppMaster.
So let’s look at how a MapReduce application works in the context of YARN.Here’s a typical MR cluster.The HDFS setup is the same – either a single NameNode plus a SecondaryName node, or two NameNodes in HA mode. MR2 does not change HDFS at all.The MRJobHistoryServer is a MR specific demon that’s necessary if you are running MR2, because the Resource Manager does not store job history. We will discuss this more later.
As always, MR continues to work on data in HDFS. Imagine we have an HDFS data file that is split into two blocks which are stored on nodes 1 and 2. (We aren’t showing replication here, but the principle would be the same…the blocks would just exist on multiple nodes.)
Let’s look at how the YARN job lifecycle applies to running an MR application like WordCount on that HDFS data file.The programs and the command on the client are exactly the same whether using MR1 or MR2 – nothing about the API or command line interface changes. So in this example we use the familiar hadoop jar command.At that point, the hadoop client will Check out the spec of the job, make sure the folders are there and so on.It will then compute splits. In this example, there are two blocks, therefore two input splits. By the way, in MR2 this is done by default on the client, but you can actually configure split computation to run on the cluster instead.The client then copies the job information to HDFS – application jar file, configuration and split information. Also anything for the distributed cache.Finally it submits the job via the YARN Client Protocol submitApplication() which passes to the resource manager.When the RM gets the submitApplication call it willTell the scheduler to allocate a container for the Application MasterTell the node manager to Launch the new container to run the Application Master – which is MRAppMaster
MRAppMaster then starts. First it will Initialize the job – Retrieve the configuration and input split information that the hadoop client shared in HDFS and set up output directories in HDFSThen it will set up Tasks required for the job. We already know there are two input splits, and therefore two map tasks. Let’s say this particular job is also set to require two reducers. So the app master will set up 4 tasks – two mappers and two reducersThe app master also Decides how to run the tasksFor small jobs, it may run it locally – in the application master’s own JVM – this is an uber taskFor when the overhead of distribution isn’t worth itwhat defines a “Small” job is configurable. by default it’s a jobs with less than 10 mappers, 1 reducer, and total size of the input is less than 1 HDFS blockUberization is disabled by default, so if you want to take advantage of this you’ll have to enable itFor larger jobs, MRAppMaster distributes to containers. This will be the case for most jobs.So the AM will request a container for each task, starting with the two map tasksThe default is 1GB RAM and one cpu core for each task, but this can be configured on a per job basis The request also includes the nodes where the HDFS data is stored so that the map tasks can run locally in possibe, as in MR2.
The resource manager will allocate the number of containers requested…if possible, on the nodes that were requested or at least in the same rack. The RM will pass the list of container IDs back to the app master
MRAppMaster will launch the tasks in the containers by contacting the NodeManager and passing a LaunchContext.The NodeManager will start a new JVM. (JVM re-use not supported in YARN)While running, the application task will communicate its status back to the App Master, which aggregates the status of all the tasks. Theclient polls the app master (once per second) for theaggregate status updates. And As we will see later, MRAppMaster provides web ui so we can access the status information there too.
When the reduce tasks are ready to run, the app master requests a container for each. In our example we happen to have set it to 2 reducers, so we request two containers. In this case, the app master sets the requested node name to ‘*’, indicating that we don’t care which nodes it runs on, because data locality doesn’t apply to reducers, only mappers.
Just as with the map tasks, the resource manager passes IDs with the requested containers.
and the appmaster will request that the node manager launch the reduce tasks.
When the reducers are done, the app master notifies the resource manager that the job – that is, the application – is complete. Which then decommissions the application’s containers (both the task containers and the app master container) and releases the resources to be scheduled for another job.
You might be wondering how the MapReduce “shuffle” process works in MR2, because in MR1 this is managed by the job tracker and task trackers which don’t exist in MR2. The answer is what we mentioned earlier: it runs as an auxiliary service in the node manager’s JVM.So, as in MR1, a map task reads its data from an HDFS file, and then writes its output to an intermediate file stored locally on the node’s hard disk.When the reducers start up, they need access to that intermediate data so they request the data from the shuffle service, which is a persistent process running in each node manager JVM. They know how to find the service, because that information was passed to them by the MRAppMaster when they started. The shuffle service passes the requested intermediate data to the reducers.And of course, the reducers write their output to HDFS files just as they did in MR1.
So now we know how things are *supposed* to happen when a MapReduce job runs on YARN.But what if something goes wrong? That’s where we look at YARN’s fault tolerance. Any of the various components in the lifecycle might fail or become unavailable.If an individual task fails – that is, a process running on a node in a container. Handling this is specific to MapReduce, not YARN in general, so it’s the job of the application master, which handles it the same way the Job Tracker did in MR1: The app master will attempt to re-run the task 4 times, or however many times we’ve configured. We can also configure a certain number of tasks that can fail, and if too many fail then the whole application is considered to have “failed”.
What if the application master itself fails?Remember that the resource manager is keeping track of application master heartbeats. If those heartbeats stop, the resource manager will restart the whole application, starting a new application master, which in turn requests new containers to run its tasks.One optional setting in MRApp Master is Job Recovery. If this is enabled, then during processing, the app master saves the progress of the various tasks. Then if the app restarts, it will retrieve that information, and only re-run tasks that were incomplete. For the tasks that have already run, their output data that was already saved will be used.
If a node in the cluster fails, the Resource Manager will notice because it’s tracking heartbeats from the node manager daemon, and will stop allocating containers on that node until it’s back up. If any application processes were running on that node, it’s the job of the application master to decide how to handle that. As we just mentioned, the MapReduce app master will restart those tasks on another node.What if the node that fails is running the application master itself? In that case, the Resource Manager will treat it as a failed application as just discussed and attempt to restart it.Finally, the Resource Manager itself can fail. There’s only a single RM active on the cluster at any time – without a RM, no applications or tasks can be launched, and the cluster is essentially unavailable. YARN has a high availability option, which lets you configure a standby resource manager which will become active if the active one goes down.
Now let’s talk about a few things to know as you move toward using MapReduce 2, starting with how to manage a YARN cluster.
The JobTracker and TaskTracker Uis in MR1 have been replaced in MRv2 with Resource Manager and Node Manager Ui, with very similar functionality.This is the main resource manager screen, where you can get general information about the cluster. the overview area shows how many jobs have been submitted, how many are running and so on, both for the cluster as a whole and for the current user.We can also look specifically at nodes, applications or the scheduler. Selecting nodes show us a list of all the nodes in the cluster – in this example we just have two because it’s a demo but obviously real clusters will have more. We can get more details about a specific node by clicking on the node name.
We can also look at a list of all the applications, filtered by state – just submitted, currently running, failed and so on.If we click the application ID…
We can see details about that application like who started it and how long it’s been running. This information is provided by YARN, but remember that YARN doesn’t know what kind of application it is, so it can’t give us any more detail than that. But we can click on the Application Master link…
To view information from the application master itself. This UI is provided by the MapReduce AppMaster. It gives us details about the mapreduce job, similar to what we would see in the job tracker ui in mr2. Remember that one of the main points of YARN is that all the application-specific framework code is moved out of the job tracker and into the app master. So this UI is provided by the appmaster itself. It shows us this job has 4 mappers, which are currently about halfway done and 1 reducer which hasn’t started yet.If we click on the job ID…
we can get details about the mapreduce job…details about the tasks, counters, configuration…and to the application’s logs. As we mentioned before, YARN aggregates the logs for an application into a single HDFS location.
One thing to note is that the resource manager does not keep a history of applications that have already run. Applications “expire” after a period of time which is configurable, and then are no longer available through the resource manager UI. But individual application frameworks can do so, and map reduce 2 does so by adding a Job History Server daemon to the cluster. We mentioned this briefly earlier. This just keeps track of all the jobs that have run, and you can see the history by going to the job history UI. Here you can browse or search through completed jobs and view their meta data, how long they ran and so on.
So, those are the cluster monitoring Uis provided natively with YARN and MR2.If you are using Cloudera Manager, of course, you can get a lot more insight into the cluster and configure and manage it from a single location.The latest version of Cloudera Manager includes full support for managing YARN and MR2. For instance, you can configure the YARN daemons and all the various properties, like how many resources are requested by map and reduce tasks…Or setting up and monitoring resources pools – how much memory and how much many cpu cores are allocated to applications.
Hue has also been updated to support MRv2. You can browse current and recent applications, or check the “show retired jobs” option to connect to the job history server to view retired jobs.
The last topic is about how MR2 and Cloudera fit together.
CDH is Cloudera’s distributing including Apache Hadoop. If you are using CDH 4, you already have MR2, because we included early versions of it in CDH 4 starting in 2012. But note that those early versions were not production ready. They are included in CDH as a technology preview so that developers and administrators could start working with the new technology. MR2 in CDH4 is not recommended for use in production systems.As we mentioned earlier, a production-ready version of MR2 and YARN was released recently, in October, so starting with CDH5 Cloudera recommends you use MR2.But MR1 is still included and supported as well, so you can make the transition in whatever way makes sense for your organization. Note, though, that you can’t run both on the same cluster: it’s either an MR1 cluster or a YARN cluster, not both.
When you are ready to move, you’ll be happy to know that we’ve made sure that all programs built for any version of CDH and map reduce will run with any other version. This means that your map reduce programs can run without recompilation, as is, whether you are moving to MR2 in CDH4 or 5.Now what if you need to do further development or update your code for unrelated reasons? You’ll need to recompile then, and In 99% of cases, all versions are source compatible as well, so you won’t need to make any code changes related to migration. There are a couple of minor instances of source incompatibility which are listed on our blog; if you happen to use those bits of code, then you’ll need to modify them before recompiling, but that shouldn’t affect most users.
Now what about the rest of the Hadoop ecosystem, beyond just MapReduce?Well, in Cloudera Enterprise and Cloudera Standard 4, as we mentioned, CDH included a preview version of MR2, and likewise Cloudera manager also included preliminary support for MR2.In Cloudera 5, Cloudera Manager has full support for MR2, as we mentioned, as do all the various ecosystem components that depend on MapReduce – such as hive, pig, mahout and so on. All mapreduce applications are tested and qualified to run on either MR1 or MR2.We’ve also included preliminary support for Impala running on YARN. Remember that one of the advantages of YARN is that applications which don’t use MapReduce but run on a Hadoop cluster can share resources, and YARN is one of those applications.
So that’s YARN and MapReduce 2 in a nutshell.
to re-iterate the key take-away points:MapReduce 2 is the next generation of MapReduce. In MR2, the resource management responsibilities in a cluster are handled by YARN. Instead of using a JobTracker to handle both resource allocation and job management, a single Resource Manager handles resources and scheduling, while a framework specific Application Master handles all the actual application management. So each MapReduce 2 job has its own MRAppMaster which handles task assignment and tracking.The advantage of this architecture is that it is more scalable, because it lightens the load on the master node, so Hadoop clusters are no longer limited to 4000 nodes.It’s more efficient because resources are allocated according to memory and processing needs as they are needed, rather than designated a certain number of Map or Reduce slots on each node.AND It’s more flexible because it allows sharing resources between different types of applications running on the same cluster.If you are a Hadoop user or developer, all these changes are pretty much invisible…they “just work”. You will still use the same MapReduce API, the same hadoop command line interface, and the web-based Uis are very similar.A finally, after several years of evolution, MapReduce 2 is finally ready for prime-time. You can start getting familiar with it in CDH 4, and get CDH 5 when you’re ready to move to MapReduce 2 for your production environment.
I hope this presentation has been helpful to you…and if you want to learn more, here are some good places to start:First check out Chapter 6 in Hadoop the Definitive Guide by Cloudera’s own Tom White, which focuses on the architecture of MapReduce on top of YARN.And finally, there are several valuable blog posts on the topic on Cloudera’s blog that will help guide you as you start to move to MapReduce 2 and YARN.