As Arun mentioned there are less JVMs to spin up per job management (1 instead of 3) as well as the RM and NM provisioning being fasterOriginally conceived & architected by the team at Yahoo!Arun Murthy created the original JIRA in 2008 and led the PMCThe team at Hortonworks has been working on YARN for 4 years: 90% of code from Hortonworks & Yahoo!YARN based architecture running at scale at Yahoo!Deployed on 35,000 nodes for 6+ monthsMultitude of YARN applications*********************On great public example of in production use of YARN, is at Yahoo!. They outlined some performance gains in a keynote address at Hadoop Summit this year. Yahoo uses YARN for three use cases, stream processing, iterative processing and shared storage. With Storm on YARN they stream data into a cluster and execute 5 second analytics windows. This cluster is only 320 nodes, but is processing 133,000 events per second and is executing 12000 threads. Their shared data cluster uses 1900 nodes to store 2PB of data.In all, Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time. They also have estimated a 60% – 150% improvement on node usage per day.ANDAt this point, over 50,000 Hadoop nodes have been upgraded at Yahoo from Hadoop 1.0 to Hadoop 2, yielding 50% improvement in cluster utilization & efficiency.This should be a big deal in terms of potential ROI.
Reconfig already works (HDP-1.3 and HDP-2.0).Chargeback - coming in HDP-2.1https://issues.apache.org/jira/browse/YARN-415----- Meeting Notes (11/14/13 11:07) -----POINT AND CLICK TIME BASED RESOURCE ALLOCATION
Graph processing – Giraph, HamaStream proessing – Smaza, Storm, Spark, DataTorrentMapReduceTez – fast query executionWeave/REEF – frameworks to help with writing applicationsList of some of the applications which already support YARN, in some form.Smaza, Storm, S4 and DataTorrent are streaming frameworksVarious types of graph processing frameworks – Giraph and Hama are graph processing systemsThere’s some github projects – caching systems, on-demand web-server spin up Wave and REEF are frameworks on top of YARN to make writing applications easier
YarnClientApplication Lifecycle Control – submit, query state, kill - .e.g Submit, followed by poll state, and optionally connect to AM for further stateCluster Information – Query node status, Scheduler queues etcAMRMCLientRequest Resources from the RM – specifies a priority (within app only), location information, and whether strict locality is required, also number of such containers.Hides the details of the protocol – which is absolute (i.e. each call to the RM must provide complete information about a priority level)Once a container is allocated, provides a utility method to link it back to the entity which generated the request (entity optionally passed in while asking for containers)Allocate call doubles as a heartbeat – must be sent out ever so often.Also provides some node level information, available resources, etcNMClientOnce resources are obtained from the RM, an App uses this to communicate with the NM to start containers. Setup the ContainerLuanchContext – cmdLine, environment, resources.Authorization information is part of the contianers allocated by the RM – to prevent unauthorized usageOnly a single process can be launched on an allocated container, and one must be within a certain period otherwise the allocation times outContainer must launch a process with a certain time after being allocatedOnly a single process can be launched on an allocated container
HA and work preserving – being actively worked upoin by the communitiy.Scheduler – Additional resources – specifically disk / network. Gang schedulingRolling upgrades – upgrading a cluster typically involves downtime. NM forgets containers across restartsLong Running – Enhandcement to log handling, security, multiple tasks per container, container resizingHA and work preserving restart are still being worked on in the community – YARN-128 and YARN-149.On scheduling – there’ve been requests for gang scheduling, meeting SLAs. Also TBD is support for scheduling additional resource types – disk/ network.Rolling Upgrades – some work pending. Big piece here, which ties in with work preserving restart – restarting a NodeManager should not cause processes started by the previous NM to be killedLong Running Services support – handling logs, security – specifically token expiryAdditional utility libraries to help AppWriters – primarily geared towards checkpointing in the AM, app history handling
The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.