SlideShare a Scribd company logo
1 of 62
Download to read offline
© Hortonworks Inc. 2013
Apache Hadoop YARN
Enabling next generation data applications
Page 1
© Hortonworks Inc. 2013
Agenda
• Why YARN?
• YARN Architecture and Concepts
• Building applications on YARN
• Next Steps
Page 2
© Hortonworks Inc. 2013
Agenda
• Why YARN?
• YARN Architecture and Concepts
• Building applications on YARN
• Next Steps
Page 3
© Hortonworks Inc. 2013
The 1st Generation of Hadoop: Batch
HADOOP 1.0
Built for Web-Scale Batch Apps
Single	
  App	
  
BATCH
HDFS
Single	
  App	
  
INTERACTIVE
Single	
  App	
  
BATCH
HDFS
•  All other usage
patterns must
leverage that same
infrastructure
•  Forces the creation
of silos for managing
mixed workloads
Single	
  App	
  
BATCH
HDFS
Single	
  App	
  
ONLINE
© Hortonworks Inc. 2013
Hadoop MapReduce Classic
• JobTracker
– Manages cluster resources and job scheduling
• TaskTracker
– Per-node agent
– Manage tasks
Page 5
© Hortonworks Inc. 2013
MapReduce Classic: Limitations
• Scalability
– Maximum Cluster size – 4,000 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker
• Availability
– Failure kills all queued and running jobs
• Hard partition of resources into map and reduce slots
– Low resource utilization
• Lacks support for alternate paradigms and services
– Iterative applications implemented using MapReduce are
10x slower
Page 6
© Hortonworks Inc. 2012© Hortonworks Inc. 2013. Confidential and Proprietary.
Our Vision: Hadoop as Next-Gen Platform
HADOOP 1.0
HDFS	
  
(redundant,	
  reliable	
  storage)	
  
MapReduce	
  
(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  
HDFS2	
  
(redundant,	
  reliable	
  storage)	
  
YARN	
  
(cluster	
  resource	
  management)	
  
MapReduce	
  
(data	
  processing)	
  
Others	
  
(data	
  processing)	
  
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 7
© Hortonworks Inc. 2013
YARN: Taking Hadoop Beyond Batch
Page 8
Applica;ons	
  Run	
  Na;vely	
  IN	
  Hadoop	
  
HDFS2	
  (Redundant,	
  Reliable	
  Storage)	
  
YARN	
  (Cluster	
  Resource	
  Management)	
  	
  	
  
BATCH	
  
(MapReduce)	
  
INTERACTIVE	
  
(Tez)	
  
STREAMING	
  
(Storm,	
  S4,…)	
  
GRAPH	
  
(Giraph)	
  
IN-­‐MEMORY	
  
(Spark)	
  
HPC	
  MPI	
  
(OpenMPI)	
  
ONLINE	
  
(HBase)	
  
OTHER	
  
(Search)	
  
(Weave…)	
  
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
© Hortonworks Inc. 2013
5
5 Key Benefits of YARN
1.  Scale
2.  New Programming Models &
Services
3.  Improved cluster utilization
4.  Agility
5.  Beyond Java
Page 9
© Hortonworks Inc. 2013
Agenda
• Why YARN
• YARN Architecture and Concepts
• Building applications on YARN
• Next Steps
Page 10
© Hortonworks Inc. 2013
A Brief History of YARN
•  Originally conceived & architected by the team at Yahoo!
– Arun Murthy created the original JIRA in 2008 and led the PMC
•  The team at Hortonworks has been working on YARN for 4 years:
90% of code from Hortonworks & Yahoo!
•  YARN based architecture running at scale at Yahoo!
– Deployed on 35,000 nodes for 6+ months
•  Multitude of YARN applications
Page 11
© Hortonworks Inc. 2013
Concepts
• Application
– Application is a job submitted to the framework
– Example – Map Reduce Job
• Container
– Basic unit of allocation
– Fine-grained resource allocation across multiple resource
types (memory, cpu, disk, network, gpu etc.)
– container_0 = 2GB, 1CPU
– container_1 = 1GB, 6 CPU
– Replaces the fixed map/reduce slots
12
© Hortonworks Inc. 2012
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
Container	
  1.1	
  
Container	
  2.4	
  
ResourceManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
Container	
  1.2	
  
Container	
  1.3	
  
AM	
  1	
  
Container	
  2.1	
  
Container	
  2.2	
  
Container	
  2.3	
  
AM2	
  
YARN Architecture
Scheduler	
  
© Hortonworks Inc. 2013
Architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues
• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
• Application Master
– Per-application
– Manages application scheduling and task execution
– E.g. MapReduce Application Master
14
© Hortonworks Inc. 2013
Design Centre
• Split up the two major functions of JobTracker
– Cluster resource management
– Application life-cycle management
• MapReduce becomes user-land library
15
© Hortonworks Inc. 2012
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
Container	
  1.1	
  
Container	
  2.4	
  
ResourceManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
Container	
  1.2	
  
Container	
  1.3	
  
AM	
  1	
  
Container	
  2.2	
  
Container	
  2.1	
  
Container	
  2.3	
  
AM2	
  
YARN Architecture - Walkthrough
Scheduler	
  Client2	
  
© Hortonworks Inc. 2013
5
Review - Benefits of YARN
1.  Scale
2.  New Programming Models &
Services
3.  Improved cluster utilization
4.  Agility
5.  Beyond Java
Page 17
© Hortonworks Inc. 2013
Agenda
• Why YARN
• YARN Architecture and Concepts
• Building applications on YARN
• Next Steps
Page 18
© Hortonworks Inc. 2013
YARN Applications
• Data processing applications and services
– Online Serving – HOYA (HBase on YARN)
– Real-time event processing – Storm, S4, other commercial
platforms
– Tez – Generic framework to run a complex DAG
– MPI: OpenMPI, MPICH2
– Master-Worker
– Machine Learning: Spark
– Graph processing: Giraph
– Enabled by allowing the use of paradigm-specific application
master
Run all on the same Hadoop cluster!
Page 19
© Hortonworks Inc. 2013
YARN – Implementing Applications
• What APIs do I need to use?
– Only three protocols
– Client to ResourceManager
– Application submission
– ApplicationMaster to ResourceManager
– Container allocation
– ApplicationMaster to NodeManager
– Container launch
– Use client libraries for all 3 actions
– Module yarn-client
– Provides both synchronous and asynchronous libraries
– Use 3rd party like Weave
–  http://continuuity.github.io/weave/
20
© Hortonworks Inc. 2013
YARN – Implementing Applications
• What do I need to do?
– Write a submission Client
– Write an ApplicationMaster (well copy-paste)
– DistributedShell is the new WordCount!
– Get containers, run whatever you want!
21
© Hortonworks Inc. 2013
YARN – Implementing Applications
• What else do I need to know?
– Resource Allocation & Usage
– ResourceRequest !
– Container !
– ContainerLaunchContext!
– LocalResource!
– ApplicationMaster
– ApplicationId
– ApplicationAttemptId!
– ApplicationSubmissionContext!
22
© Hortonworks Inc. 2013
YARN – Resource Allocation & Usage!
• ResourceRequest!
– Fine-grained resource ask to the ResourceManager
– Ask for a specific amount of resources (memory, cpu etc.) on a
specific machine or rack
– Use special value of * for resource name for any machine
Page 23
ResourceRequest!
priority!
resourceName!
capability!
numContainers!
© Hortonworks Inc. 2013
YARN – Resource Allocation & Usage!
• ResourceRequest!
!
Page 24
priority! capability! resourceName! numContainers!
!
0!
!
<2gb, 1 core>!
host01! 1!
rack0! 1!
*! 1!
1! <4gb, 1 core>! *! 1!
© Hortonworks Inc. 2013
YARN – Resource Allocation & Usage!
• Container!
– The basic unit of allocation in YARN
– The result of the ResourceRequest provided by
ResourceManager to the ApplicationMaster
– A specific amount of resources (cpu, memory etc.) on a specific
machine
Page 25
Container!
containerId!
resourceName!
capability!
tokens!
© Hortonworks Inc. 2013
YARN – Resource Allocation & Usage!
• ContainerLaunchContext!
– The context provided by ApplicationMaster to NodeManager to
launch the Container!
– Complete specification for a process
– LocalResource used to specify container binary and
dependencies
– NodeManager responsible for downloading from shared namespace
(typically HDFS)
Page 26
ContainerLaunchContext!
container!
commands!
environment!
localResources! LocalResource!
uri!
type!
© Hortonworks Inc. 2013
YARN - ApplicationMaster
• ApplicationMaster
– Per-application controller aka container_0!
– Parent for all containers of the application
– ApplicationMaster negotiates all it’s containers from
ResourceManager
– ApplicationMaster container is child of ResourceManager
– Think init process in Unix
– RM restarts the ApplicationMaster attempt if required (unique
ApplicationAttemptId)
– Code for application is submitted along with Application itself
Page 27
© Hortonworks Inc. 2013
YARN - ApplicationMaster
• ApplicationMaster
– ApplicationSubmissionContext is the complete
specification of the ApplicationMaster, provided by Client
– ResourceManager responsible for allocating and launching
ApplicationMaster container
Page 28
ApplicationSubmissionContext!
resourceRequest!
containerLaunchContext!
appName!
queue!
© Hortonworks Inc. 2013
YARN Application API - Overview
• hadoop-yarn-client module
• YarnClient is submission client api
• Both synchronous & asynchronous APIs for resource
allocation and container start/stop
• Synchronous API
– AMRMClient!
– AMNMClient!
• Asynchronous API
– AMRMClientAsync!
– AMNMClientAsync!
Page 29
© Hortonworks Inc. 2012
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
Container	
  1.1	
  
Container	
  2.4	
  
ResourceManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
Container	
  1.2	
  
Container	
  1.3	
  
AM	
  1	
  
Container	
  2.2	
  
Container	
  2.1	
  
Container	
  2.3	
  
AM2	
  
Client2	
  
	
  
New Application Request:
YarnClient.createApplication!
Submit Application:
YarnClient.submitApplication!
1
2
YARN Application API – The Client
Scheduler	
  
© Hortonworks Inc. 2012
•  YarnClient!
–  createApplication to create application
–  submitApplication to start application
•  Application developer needs to provide
ApplicationSubmissionContext!
–  APIs to get other information from ResourceManager
•  getAllQueues!
•  getApplications!
•  getNodeReports!
–  APIs to manipulate submitted application e.g.
killApplication!
31
YARN Application API – The Client
© Hortonworks Inc. 2012
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
YARN Application API – Resource Allocation
ResourceManager	
  
NodeManager	
   NodeManager	
   NodeManager	
  
AM	
  
registerApplicationMaster!
1
4
AMRMClient.allocate!
Container!
2
3
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
unregisterApplicationMaster!
Scheduler	
  
© Hortonworks Inc. 2012
•  AMRMClient - Synchronous API for ApplicationMaster
to interact with ResourceManager
–  Prologue / epilogue – registerApplicationMaster /
unregisterApplicationMaster
–  Resource negotiation with ResourceManager
•  Internal book-keeping - addContainerRequest /
removeContainerRequest / releaseAssignedContainer!
•  Main API – allocate!
–  Helper APIs for cluster information
•  getAvailableResources!
•  getClusterNodeCount!
33
YARN Application API – Resource Allocation
© Hortonworks Inc. 2012
•  AMRMClientAsync - Asynchronous API for
ApplicationMaster
–  Extension of AMRMClient to provide asynchronous
CallbackHandler!
–  Callbacks make it easier to build mental model of interaction
with ResourceManager for the application developer
•  onContainersAllocated!
•  onContainersCompleted!
•  onNodesUpdated!
•  onError!
•  onShutdownRequest!
34
YARN Application API – Resource Allocation
© Hortonworks Inc. 2012
NodeManager	
   NodeManager	
  
	
  
NodeManager	
  
	
  
NodeManager	
  
	
  
YARN Application API – Using Resources
Container	
  1.1	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
	
  
NodeManager	
   NodeManager	
  
	
  
NodeManager	
  
	
  
NodeManager	
  
	
  
AM	
  1	
  
AMNMClient.startContainer!
AMNMClient.getContainerStatus!
ResourceManager	
  
Scheduler	
  
© Hortonworks Inc. 2012
•  AMNMClient - Synchronous API for ApplicationMaster
to launch / stop containers at NodeManager
–  Simple (trivial) APIs
•  startContainer!
•  stopContainer!
•  getContainerStatus!
36
YARN Application API – Using Resources
© Hortonworks Inc. 2012
•  AMNMClient - Asynchronous API for ApplicationMaster
to launch / stop containers at NodeManager
–  Simple (trivial) APIs
•  startContainerAsync!
•  stopContainerAsync!
•  getContainerStatusAsync!
–  CallbackHandler to make it easier to build mental model of
interaction with NodeManager for the application developer
•  onContainerStarted!
•  onContainerStopped!
•  onStartContainerError!
•  onContainerStatusReceived!
37
YARN Application API – Using Resources
© Hortonworks Inc. 2013
YARN Application API - Development
• Un-Managed Mode for ApplicationMaster
– Run ApplicationMaster on development machine rather than in-
cluster
– No submission client
– hadoop-yarn-applications-unmanaged-am-launcher!
– Easier to step through debugger, browse logs etc.!
Page 38
!
$ bin/hadoop jar hadoop-yarn-applications-unmanaged-am-launcher.jar !
Client !
–jar my-application-master.jar !
–cmd ‘java MyApplicationMaster <args>’!
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
• Overview
– YARN application to run n copies for a Shell command
– Simplest example of a YARN application – get n containers and
run a specific Unix command
Page 39
!
$ bin/hadoop jar hadoop-yarn-applications-distributedshell.jar  !
org.apache.hadoop.yarn.applications.distributedshell.Client !
–shell_command ‘/bin/date’ !
–num_containers <n>!
Code: https://github.com/hortonworks/simple-yarn-app
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
• Code Overview
– User submits application to ResourceManager via
org.apache.hadoop.yarn.applications.distributedsh
ell.Client!
– Client provides ApplicationSubmissionContext to the
ResourceManager
– It is responsibility of
org.apache.hadoop.yarn.applications.distributedsh
ell.ApplicationMaster to negotiate n containers
– ApplicationMaster launches containers with the user-specified
command as ContainerLaunchContext.commands!
Page 40
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
• Client – Code Walkthrough
– hadoop-yarn-client module
– Steps:
– YarnClient.createApplication!
– Specify ApplicationSubmissionContext, in particular,
ContainerLaunchContext with commands, and other key pieces
such as resource capability for ApplicationMaster container and
queue, appName, appType etc.
– YarnClient.submitApplication!
Page 41
1 . // Create yarnClient!
2 .   YarnClient yarnClient = YarnClient.createYarnClient();!
3 .   yarnClient.init(new Configuration());!
4 .   yarnClient.start();!
5 . !
6 .    // Create application via yarnClient!
7 .  YarnClientApplication app = yarnClient.createApplication();!
8 . !
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
Page 42
9.  // Set up the container launch context for the application master!
1 0.    ContainerLaunchContext amContainer = !
1 1.        Records.newRecord(ContainerLaunchContext.class);!
1 2.    List<String> command = new List<String>();!
1 3.    commands.add("$JAVA_HOME/bin/java");!
1 4.    commands.add("-Xmx256M");!
1 5.    commands.add(!
1 6.        "org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster");!
1 7.    commands.add("--container_memory 1024” );!
1 8.    commands.add("--container_cores 1”);!
1 9.    commands.add("--num_containers 3");;!
2 0.    amContainer.setCommands(commands);!
2 1. !
2 2.    // Set up resource type requirements for ApplicationMaster!
2 3.    Resource capability = Records.newRecord(Resource.class);!
2 4.    capability.setMemory(256);!
2 5.    capability.setVirtualCores(2);!
2 6. !
• Client – Code Walkthrough
Command to launch ApplicationMaster
process
Resources required for
ApplicationMaster container!
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
Page 43
2 7.    // Finally, set-up ApplicationSubmissionContext for the application!
2 8.    ApplicationSubmissionContext appContext = !
2 9.        app.getApplicationSubmissionContext();!
3 0.    appContext.setQueue("my-queue");            // queue !
3 1.    appContext.setAMContainerSpec(amContainer);!
3 2.    appContext.setResource(capability);!
3 3.    appContext.setApplicationName("my-app");     // application name!
3 4.    appContext.setApplicationType(”DISTRIBUTED_SHELL");    // application type!
3 5. !
3 6.    // Submit application!
3 7.    yarnClient.submitApplication(appContext);
• Client – Code Walkthrough
ApplicationSubmissionContext !
for
ApplicationMaster
Submit application to
ResourceManager
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
• ApplicationMaster – Code Walkthrough
– Again, hadoop-yarn-client module
– Steps:
– AMRMClient.registerApplication!
– Negotiate containers from ResourceManager by providing
ContainerRequest to AMRMClient.addContainerRequest
– Take the resultant Container returned via subsequent call to
AMRMClient.allocate, build ContainerLaunchContext with
Container and commands, then launch them using
AMNMClient.launchContainer!
– Use LocalResources to specify software/configuration dependencies
for each worker container
– Wait till done…
AllocateResponse.getCompletedContainersStatuses from
subsequent calls to AMRMClient.allocate
– AMRMClient.unregisterApplication!
Page 44
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
• ApplicationMaster – Code Walkthrough
Page 45
1.  // Initialize clients to ResourceManager and NodeManagers!
2 .        Configuration conf = new Configuration();!
3 . !
4 .        AMRMClient rmClient = AMRMClientAsync.createAMRMClient();!
5 .        rmClient.init(conf);!
6 .        rmClient.start();!
7 . !
8 .        NMClient nmClient = NMClient.createNMClient();!
9 .        nmClientAsync.init(conf);!
1 0.     nmClientAsync.start();!
1 1.!
1 2.     // Register with ResourceManager!
1 3.     rmClient.registerApplicationMaster("", 0, "");!
Initialize clients to
ResourceManager and
NodeManagers
Register with
ResourceManager
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
• ApplicationMaster – Code Walkthrough
Page 46
1 5.    // Priority for worker containers - priorities are intra-application!
1 6.    Priority priority = Records.newRecord(Priority.class);!
1 7.    priority.setPriority(0);      !
1 8. !
1 9.    // Resource requirements for worker containers!
2 0.    Resource capability = Records.newRecord(Resource.class);!
2 1.    capability.setMemory(128);!
2 2.    capability.setVirtualCores(1);!
2 3. !
2 4.    // Make container requests to ResourceManager!
2 5.    for (int i = 0; i < n; ++i) {!
2 6.      ContainerRequest containerAsk = new ContainerRequest(capability, null, null,
priority);!
2 7.      rmClient.addContainerRequest(containerAsk);!
2 8.    }
!
Setup requirements for
worker containers
Make resource requests
to ResourceManager
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
• ApplicationMaster – Code Walkthrough
Page 47
!
3 0.    // Obtain allocated containers and launch !
3 1.    int allocatedContainers = 0;!
3 2.    while (allocatedContainers < n) {!
3 3.      AllocateResponse response = rmClient.allocate(0);!
3 4.      for (Container container : response.getAllocatedContainers()) {!
3 5.        ++allocatedContainers;!
3 6. !
3 7.        // Launch container by create ContainerLaunchContext!
3 8.        ContainerLaunchContext ctx = Records.newRecord(ContainerLaunchContext.class);!
3 9.        ctx.setCommands(Collections.singletonList(”/bin/date”));!
4 0.        nmClient.startContainer(container, ctx);!
4 1.      }!
4 2.      Thread.sleep(100);!
4 3.    }
!
Setup requirements for
worker containers
Make resource requests
to ResourceManager
© Hortonworks Inc. 2013
Sample YARN Application – DistributedShell
• ApplicationMaster – Code Walkthrough
Page 48
4     // Now wait for containers to complete!
4     int completedContainers = 0;!
4     while (completedContainers < n) {!
4       AllocateResponse response = rmClient.allocate(completedContainers/n);!
4       for (ContainerStatus status : response.getCompletedContainersStatuses()) {!
5         if (status.getExitStatus() == 0) {!
5           ++completedContainers;!
5         }  !
5       }!
4       Thread.sleep(100);!
5     }!
5  !
5     // Un-register with ResourceManager!
5     rmClient.unregisterApplicationMaster(SUCCEEDED, "", "");!
Wait for containers to
complete successfully
Un-register with
ResourceManager
© Hortonworks Inc. 2013
Apache Hadoop MapReduce on YARN
• Original use-case
• Most complex application to build
– Data-locality
– Fault tolerance
– ApplicationMaster recovery: Check point to HDFS
– Intra-application Priorities: Maps v/s Reduces
– Needed complex feedback mechanism from ResourceManager
– Security
– Isolation
• Binary compatible with Apache Hadoop 1.x
Page 49
© Hortonworks Inc. 2012
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
map	
  1.1	
  
reduce2.1	
  
ResourceManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
map1.2	
  
reduce1.1	
  
MR	
  AM	
  1	
  
map2.1	
  
map2.2	
  
reduce2.2	
  
MR	
  AM2	
  
Apache Hadoop MapReduce on YARN
Scheduler	
  
© Hortonworks Inc. 2013
Apache Tez on YARN
• Replaces MapReduce as primitive for Pig, Hive,
Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– http://incubator.apache.org/projects/tez.html
Page 51
Hive - Tez
SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id)
GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c
GROUP BY x
ORDER BY AVG;
Hive - MR
© Hortonworks Inc. 2012
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
map	
  1.1	
  
vertex1.2.2	
  
ResourceManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
map1.2	
  
reduce1.1	
  
MR	
  AM	
  1	
  
vertex1.1.1	
  
vertex1.1.2	
  
vertex1.2.1	
  
Tez	
  AM1	
  
Apache Tez on YARN
Scheduler	
  
© Hortonworks Inc. 2013
HOYA - Apache HBase on YARN
Use cases
• Small HBase cluster in large YARN cluster
• Dynamic HBase clusters
• Transient/intermittent clusters for workflows
Hoya:
• HBase becomes user-level application
• create, start, stop, deletes HBase clusters
• Flex cluster size: increase/decrease size with load
• Recover from Region Server loss with new container.
Page 53
Code: https://github.com/hortonworks/hoya
© Hortonworks Inc. 2012
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
map	
  1.1	
  
Hbase	
  Master1.1	
  
ResourceManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
map1.2	
  
reduce1.1	
  
MR	
  AM	
  1	
  
RegionServer	
  1.1	
  
RegionServer	
  1.2	
  
RegionServer	
  1.3	
  
HOYA	
  AM1	
  
HOYA - Apache HBase on YARN
Scheduler	
  
© Hortonworks Inc. 2013
HOYA - Highlights
• Cluster specification stored as JSON in HDFS
• Config directory cached - dynamically patched before
pushing up as local resources for Master &
RegionServers
• HBase tar file stored in HDFS -clusters can use the
same/different HBase versions
• Handling of cluster flexing is the same code as
unplanned container loss.
• No Hoya code on RegionServers: client and AM only
Page 55
© Hortonworks Inc. 2013
Storm on YARN
• Ability to deploy multiple Storm clusters on YARN for
real-time event processing
• Yahoo – Primary contributor
– 200+ nodes in production
• Ability to recover from faulty nodes
– Get new containers
• Auto-scale for load balancing
– Get new containers as load increases
– Release containers as load decreases
Page 56
Code: https://github.com/yahoo/storm-yarn
© Hortonworks Inc. 2012
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
map	
  1.1	
  
Nimbus1.4	
  
ResourceManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
NodeManager	
   NodeManager	
   NodeManager	
   NodeManager	
  
map1.2	
  
reduce1.1	
  
MR	
  AM	
  1	
  
Nimbus1.1	
  
Nimbus1.2	
  
Nimbus1.3	
  
Storm	
  AM1	
  
Storm on YARN
Scheduler	
  
© Hortonworks Inc. 2013
General Architectural Considerations
• Fault Tolerance
– Checkpoint
• Security
• Always-On services
• Scheduler features
– Whitelist resources
– Blacklist resources
– Labels for machines
– License management
Page 58
© Hortonworks Inc. 2013
Agenda
• Why YARN
• YARN Architecture and Concepts
• Building applications on YARN
• Next Steps
Page 59
© Hortonworks Inc. 2013
HDP 2.0 Community Preview & YARN Certification Program
Goal: Accelerate # of certified YARN-based solutions
•  HDP 2.0 Community Preview
– Contains latest community Beta of
Apache Hadoop 2.0 & YARN
– Delivered as easy to use Sandbox
VM, as well as RPMs and Tarballs
– Enables YARN Cert Program
Community & commercial
ecosystem to test and certify new
and existing YARN-based apps
•  YARN Certification Program
– More than 14 partners in
program at launch
–  Splunk*
–  Elastic Search*
–  Altiscale*
–  Concurrent*
–  Microsoft
–  Platfora
–  Tableau
–  (IBM) DataStage
–  Informatica
–  Karmasphere
–  and others
* Already certified
© Hortonworks Inc. 2013
Technical Resources
• hadoop-2.1.0-beta:
–  http://hadoop.apache.org/common/releases.html
• Release Documentation:
–  http://hadoop.apache.org/common/docs/r2.1.0-beta/
• Blogs:
–  http://hortonworks.com/blog/category/apache-hadoop/yarn/
–  http://hortonworks.com/blog/introducing-apache-hadoop-yarn/
–  http://hortonworks.com/blog/apache-hadoop-yarn-background-
and-an-overview/
Page 61
© Hortonworks Inc. 2013
What Next?
• Download the Book
• Download the Community preview
• Join the Beta Program
• Follow us… @hortonworks
• THANK YOU
Page 62
http://www.hortonworks.com/hadoop/yarn

More Related Content

What's hot

Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureVinod Kumar Vavilapalli
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersDataWorks Summit
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupRommel Garcia
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNHortonworks
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambariHortonworks
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 
Ambari Meetup: YARN
Ambari Meetup: YARNAmbari Meetup: YARN
Ambari Meetup: YARNHortonworks
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Simplilearn
 

What's hot (20)

Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
 
Yarn
YarnYarn
Yarn
 
Hadoop YARN overview
Hadoop YARN overviewHadoop YARN overview
Hadoop YARN overview
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
Ambari Meetup: YARN
Ambari Meetup: YARNAmbari Meetup: YARN
Ambari Meetup: YARN
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
 

Viewers also liked

Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveGeekNightHyderabad
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 

Viewers also liked (6)

Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 

Similar to Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele

Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionDataWorks Summit
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopHortonworks
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionWangda Tan
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarHortonworks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingNandan Kumar
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 

Similar to Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele (20)

Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Yarnthug2014
Yarnthug2014Yarnthug2014
Yarnthug2014
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Yarn
YarnYarn
Yarn
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider Webinar
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch training
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 

More from Hakka Labs

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceHakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartHakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleHakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesHakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 

More from Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele

  • 1. © Hortonworks Inc. 2013 Apache Hadoop YARN Enabling next generation data applications Page 1
  • 2. © Hortonworks Inc. 2013 Agenda • Why YARN? • YARN Architecture and Concepts • Building applications on YARN • Next Steps Page 2
  • 3. © Hortonworks Inc. 2013 Agenda • Why YARN? • YARN Architecture and Concepts • Building applications on YARN • Next Steps Page 3
  • 4. © Hortonworks Inc. 2013 The 1st Generation of Hadoop: Batch HADOOP 1.0 Built for Web-Scale Batch Apps Single  App   BATCH HDFS Single  App   INTERACTIVE Single  App   BATCH HDFS •  All other usage patterns must leverage that same infrastructure •  Forces the creation of silos for managing mixed workloads Single  App   BATCH HDFS Single  App   ONLINE
  • 5. © Hortonworks Inc. 2013 Hadoop MapReduce Classic • JobTracker – Manages cluster resources and job scheduling • TaskTracker – Per-node agent – Manage tasks Page 5
  • 6. © Hortonworks Inc. 2013 MapReduce Classic: Limitations • Scalability – Maximum Cluster size – 4,000 nodes – Maximum concurrent tasks – 40,000 – Coarse synchronization in JobTracker • Availability – Failure kills all queued and running jobs • Hard partition of resources into map and reduce slots – Low resource utilization • Lacks support for alternate paradigms and services – Iterative applications implemented using MapReduce are 10x slower Page 6
  • 7. © Hortonworks Inc. 2012© Hortonworks Inc. 2013. Confidential and Proprietary. Our Vision: Hadoop as Next-Gen Platform HADOOP 1.0 HDFS   (redundant,  reliable  storage)   MapReduce   (cluster  resource  management    &  data  processing)   HDFS2   (redundant,  reliable  storage)   YARN   (cluster  resource  management)   MapReduce   (data  processing)   Others   (data  processing)   HADOOP 2.0 Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, … Page 7
  • 8. © Hortonworks Inc. 2013 YARN: Taking Hadoop Beyond Batch Page 8 Applica;ons  Run  Na;vely  IN  Hadoop   HDFS2  (Redundant,  Reliable  Storage)   YARN  (Cluster  Resource  Management)       BATCH   (MapReduce)   INTERACTIVE   (Tez)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   ONLINE   (HBase)   OTHER   (Search)   (Weave…)   Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
  • 9. © Hortonworks Inc. 2013 5 5 Key Benefits of YARN 1.  Scale 2.  New Programming Models & Services 3.  Improved cluster utilization 4.  Agility 5.  Beyond Java Page 9
  • 10. © Hortonworks Inc. 2013 Agenda • Why YARN • YARN Architecture and Concepts • Building applications on YARN • Next Steps Page 10
  • 11. © Hortonworks Inc. 2013 A Brief History of YARN •  Originally conceived & architected by the team at Yahoo! – Arun Murthy created the original JIRA in 2008 and led the PMC •  The team at Hortonworks has been working on YARN for 4 years: 90% of code from Hortonworks & Yahoo! •  YARN based architecture running at scale at Yahoo! – Deployed on 35,000 nodes for 6+ months •  Multitude of YARN applications Page 11
  • 12. © Hortonworks Inc. 2013 Concepts • Application – Application is a job submitted to the framework – Example – Map Reduce Job • Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource types (memory, cpu, disk, network, gpu etc.) – container_0 = 2GB, 1CPU – container_1 = 1GB, 6 CPU – Replaces the fixed map/reduce slots 12
  • 13. © Hortonworks Inc. 2012 NodeManager   NodeManager   NodeManager   NodeManager   Container  1.1   Container  2.4   ResourceManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   Container  1.2   Container  1.3   AM  1   Container  2.1   Container  2.2   Container  2.3   AM2   YARN Architecture Scheduler  
  • 14. © Hortonworks Inc. 2013 Architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master 14
  • 15. © Hortonworks Inc. 2013 Design Centre • Split up the two major functions of JobTracker – Cluster resource management – Application life-cycle management • MapReduce becomes user-land library 15
  • 16. © Hortonworks Inc. 2012 NodeManager   NodeManager   NodeManager   NodeManager   Container  1.1   Container  2.4   ResourceManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   Container  1.2   Container  1.3   AM  1   Container  2.2   Container  2.1   Container  2.3   AM2   YARN Architecture - Walkthrough Scheduler  Client2  
  • 17. © Hortonworks Inc. 2013 5 Review - Benefits of YARN 1.  Scale 2.  New Programming Models & Services 3.  Improved cluster utilization 4.  Agility 5.  Beyond Java Page 17
  • 18. © Hortonworks Inc. 2013 Agenda • Why YARN • YARN Architecture and Concepts • Building applications on YARN • Next Steps Page 18
  • 19. © Hortonworks Inc. 2013 YARN Applications • Data processing applications and services – Online Serving – HOYA (HBase on YARN) – Real-time event processing – Storm, S4, other commercial platforms – Tez – Generic framework to run a complex DAG – MPI: OpenMPI, MPICH2 – Master-Worker – Machine Learning: Spark – Graph processing: Giraph – Enabled by allowing the use of paradigm-specific application master Run all on the same Hadoop cluster! Page 19
  • 20. © Hortonworks Inc. 2013 YARN – Implementing Applications • What APIs do I need to use? – Only three protocols – Client to ResourceManager – Application submission – ApplicationMaster to ResourceManager – Container allocation – ApplicationMaster to NodeManager – Container launch – Use client libraries for all 3 actions – Module yarn-client – Provides both synchronous and asynchronous libraries – Use 3rd party like Weave –  http://continuuity.github.io/weave/ 20
  • 21. © Hortonworks Inc. 2013 YARN – Implementing Applications • What do I need to do? – Write a submission Client – Write an ApplicationMaster (well copy-paste) – DistributedShell is the new WordCount! – Get containers, run whatever you want! 21
  • 22. © Hortonworks Inc. 2013 YARN – Implementing Applications • What else do I need to know? – Resource Allocation & Usage – ResourceRequest ! – Container ! – ContainerLaunchContext! – LocalResource! – ApplicationMaster – ApplicationId – ApplicationAttemptId! – ApplicationSubmissionContext! 22
  • 23. © Hortonworks Inc. 2013 YARN – Resource Allocation & Usage! • ResourceRequest! – Fine-grained resource ask to the ResourceManager – Ask for a specific amount of resources (memory, cpu etc.) on a specific machine or rack – Use special value of * for resource name for any machine Page 23 ResourceRequest! priority! resourceName! capability! numContainers!
  • 24. © Hortonworks Inc. 2013 YARN – Resource Allocation & Usage! • ResourceRequest! ! Page 24 priority! capability! resourceName! numContainers! ! 0! ! <2gb, 1 core>! host01! 1! rack0! 1! *! 1! 1! <4gb, 1 core>! *! 1!
  • 25. © Hortonworks Inc. 2013 YARN – Resource Allocation & Usage! • Container! – The basic unit of allocation in YARN – The result of the ResourceRequest provided by ResourceManager to the ApplicationMaster – A specific amount of resources (cpu, memory etc.) on a specific machine Page 25 Container! containerId! resourceName! capability! tokens!
  • 26. © Hortonworks Inc. 2013 YARN – Resource Allocation & Usage! • ContainerLaunchContext! – The context provided by ApplicationMaster to NodeManager to launch the Container! – Complete specification for a process – LocalResource used to specify container binary and dependencies – NodeManager responsible for downloading from shared namespace (typically HDFS) Page 26 ContainerLaunchContext! container! commands! environment! localResources! LocalResource! uri! type!
  • 27. © Hortonworks Inc. 2013 YARN - ApplicationMaster • ApplicationMaster – Per-application controller aka container_0! – Parent for all containers of the application – ApplicationMaster negotiates all it’s containers from ResourceManager – ApplicationMaster container is child of ResourceManager – Think init process in Unix – RM restarts the ApplicationMaster attempt if required (unique ApplicationAttemptId) – Code for application is submitted along with Application itself Page 27
  • 28. © Hortonworks Inc. 2013 YARN - ApplicationMaster • ApplicationMaster – ApplicationSubmissionContext is the complete specification of the ApplicationMaster, provided by Client – ResourceManager responsible for allocating and launching ApplicationMaster container Page 28 ApplicationSubmissionContext! resourceRequest! containerLaunchContext! appName! queue!
  • 29. © Hortonworks Inc. 2013 YARN Application API - Overview • hadoop-yarn-client module • YarnClient is submission client api • Both synchronous & asynchronous APIs for resource allocation and container start/stop • Synchronous API – AMRMClient! – AMNMClient! • Asynchronous API – AMRMClientAsync! – AMNMClientAsync! Page 29
  • 30. © Hortonworks Inc. 2012 NodeManager   NodeManager   NodeManager   NodeManager   Container  1.1   Container  2.4   ResourceManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   Container  1.2   Container  1.3   AM  1   Container  2.2   Container  2.1   Container  2.3   AM2   Client2     New Application Request: YarnClient.createApplication! Submit Application: YarnClient.submitApplication! 1 2 YARN Application API – The Client Scheduler  
  • 31. © Hortonworks Inc. 2012 •  YarnClient! –  createApplication to create application –  submitApplication to start application •  Application developer needs to provide ApplicationSubmissionContext! –  APIs to get other information from ResourceManager •  getAllQueues! •  getApplications! •  getNodeReports! –  APIs to manipulate submitted application e.g. killApplication! 31 YARN Application API – The Client
  • 32. © Hortonworks Inc. 2012 NodeManager   NodeManager   NodeManager   NodeManager   YARN Application API – Resource Allocation ResourceManager   NodeManager   NodeManager   NodeManager   AM   registerApplicationMaster! 1 4 AMRMClient.allocate! Container! 2 3 NodeManager   NodeManager   NodeManager   NodeManager   unregisterApplicationMaster! Scheduler  
  • 33. © Hortonworks Inc. 2012 •  AMRMClient - Synchronous API for ApplicationMaster to interact with ResourceManager –  Prologue / epilogue – registerApplicationMaster / unregisterApplicationMaster –  Resource negotiation with ResourceManager •  Internal book-keeping - addContainerRequest / removeContainerRequest / releaseAssignedContainer! •  Main API – allocate! –  Helper APIs for cluster information •  getAvailableResources! •  getClusterNodeCount! 33 YARN Application API – Resource Allocation
  • 34. © Hortonworks Inc. 2012 •  AMRMClientAsync - Asynchronous API for ApplicationMaster –  Extension of AMRMClient to provide asynchronous CallbackHandler! –  Callbacks make it easier to build mental model of interaction with ResourceManager for the application developer •  onContainersAllocated! •  onContainersCompleted! •  onNodesUpdated! •  onError! •  onShutdownRequest! 34 YARN Application API – Resource Allocation
  • 35. © Hortonworks Inc. 2012 NodeManager   NodeManager     NodeManager     NodeManager     YARN Application API – Using Resources Container  1.1   NodeManager   NodeManager   NodeManager   NodeManager     NodeManager   NodeManager     NodeManager     NodeManager     AM  1   AMNMClient.startContainer! AMNMClient.getContainerStatus! ResourceManager   Scheduler  
  • 36. © Hortonworks Inc. 2012 •  AMNMClient - Synchronous API for ApplicationMaster to launch / stop containers at NodeManager –  Simple (trivial) APIs •  startContainer! •  stopContainer! •  getContainerStatus! 36 YARN Application API – Using Resources
  • 37. © Hortonworks Inc. 2012 •  AMNMClient - Asynchronous API for ApplicationMaster to launch / stop containers at NodeManager –  Simple (trivial) APIs •  startContainerAsync! •  stopContainerAsync! •  getContainerStatusAsync! –  CallbackHandler to make it easier to build mental model of interaction with NodeManager for the application developer •  onContainerStarted! •  onContainerStopped! •  onStartContainerError! •  onContainerStatusReceived! 37 YARN Application API – Using Resources
  • 38. © Hortonworks Inc. 2013 YARN Application API - Development • Un-Managed Mode for ApplicationMaster – Run ApplicationMaster on development machine rather than in- cluster – No submission client – hadoop-yarn-applications-unmanaged-am-launcher! – Easier to step through debugger, browse logs etc.! Page 38 ! $ bin/hadoop jar hadoop-yarn-applications-unmanaged-am-launcher.jar ! Client ! –jar my-application-master.jar ! –cmd ‘java MyApplicationMaster <args>’!
  • 39. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell • Overview – YARN application to run n copies for a Shell command – Simplest example of a YARN application – get n containers and run a specific Unix command Page 39 ! $ bin/hadoop jar hadoop-yarn-applications-distributedshell.jar ! org.apache.hadoop.yarn.applications.distributedshell.Client ! –shell_command ‘/bin/date’ ! –num_containers <n>! Code: https://github.com/hortonworks/simple-yarn-app
  • 40. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell • Code Overview – User submits application to ResourceManager via org.apache.hadoop.yarn.applications.distributedsh ell.Client! – Client provides ApplicationSubmissionContext to the ResourceManager – It is responsibility of org.apache.hadoop.yarn.applications.distributedsh ell.ApplicationMaster to negotiate n containers – ApplicationMaster launches containers with the user-specified command as ContainerLaunchContext.commands! Page 40
  • 41. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell • Client – Code Walkthrough – hadoop-yarn-client module – Steps: – YarnClient.createApplication! – Specify ApplicationSubmissionContext, in particular, ContainerLaunchContext with commands, and other key pieces such as resource capability for ApplicationMaster container and queue, appName, appType etc. – YarnClient.submitApplication! Page 41 1 . // Create yarnClient! 2 .   YarnClient yarnClient = YarnClient.createYarnClient();! 3 .   yarnClient.init(new Configuration());! 4 .   yarnClient.start();! 5 . ! 6 .    // Create application via yarnClient! 7 .  YarnClientApplication app = yarnClient.createApplication();! 8 . !
  • 42. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell Page 42 9.  // Set up the container launch context for the application master! 1 0.    ContainerLaunchContext amContainer = ! 1 1.        Records.newRecord(ContainerLaunchContext.class);! 1 2.    List<String> command = new List<String>();! 1 3.    commands.add("$JAVA_HOME/bin/java");! 1 4.    commands.add("-Xmx256M");! 1 5.    commands.add(! 1 6.        "org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster");! 1 7.    commands.add("--container_memory 1024” );! 1 8.    commands.add("--container_cores 1”);! 1 9.    commands.add("--num_containers 3");;! 2 0.    amContainer.setCommands(commands);! 2 1. ! 2 2.    // Set up resource type requirements for ApplicationMaster! 2 3.    Resource capability = Records.newRecord(Resource.class);! 2 4.    capability.setMemory(256);! 2 5.    capability.setVirtualCores(2);! 2 6. ! • Client – Code Walkthrough Command to launch ApplicationMaster process Resources required for ApplicationMaster container!
  • 43. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell Page 43 2 7.    // Finally, set-up ApplicationSubmissionContext for the application! 2 8.    ApplicationSubmissionContext appContext = ! 2 9.        app.getApplicationSubmissionContext();! 3 0.    appContext.setQueue("my-queue");            // queue ! 3 1.    appContext.setAMContainerSpec(amContainer);! 3 2.    appContext.setResource(capability);! 3 3.    appContext.setApplicationName("my-app");     // application name! 3 4.    appContext.setApplicationType(”DISTRIBUTED_SHELL");    // application type! 3 5. ! 3 6.    // Submit application! 3 7.    yarnClient.submitApplication(appContext); • Client – Code Walkthrough ApplicationSubmissionContext ! for ApplicationMaster Submit application to ResourceManager
  • 44. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell • ApplicationMaster – Code Walkthrough – Again, hadoop-yarn-client module – Steps: – AMRMClient.registerApplication! – Negotiate containers from ResourceManager by providing ContainerRequest to AMRMClient.addContainerRequest – Take the resultant Container returned via subsequent call to AMRMClient.allocate, build ContainerLaunchContext with Container and commands, then launch them using AMNMClient.launchContainer! – Use LocalResources to specify software/configuration dependencies for each worker container – Wait till done… AllocateResponse.getCompletedContainersStatuses from subsequent calls to AMRMClient.allocate – AMRMClient.unregisterApplication! Page 44
  • 45. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell • ApplicationMaster – Code Walkthrough Page 45 1.  // Initialize clients to ResourceManager and NodeManagers! 2 .        Configuration conf = new Configuration();! 3 . ! 4 .        AMRMClient rmClient = AMRMClientAsync.createAMRMClient();! 5 .        rmClient.init(conf);! 6 .        rmClient.start();! 7 . ! 8 .        NMClient nmClient = NMClient.createNMClient();! 9 .        nmClientAsync.init(conf);! 1 0.     nmClientAsync.start();! 1 1.! 1 2.     // Register with ResourceManager! 1 3.     rmClient.registerApplicationMaster("", 0, "");! Initialize clients to ResourceManager and NodeManagers Register with ResourceManager
  • 46. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell • ApplicationMaster – Code Walkthrough Page 46 1 5.    // Priority for worker containers - priorities are intra-application! 1 6.    Priority priority = Records.newRecord(Priority.class);! 1 7.    priority.setPriority(0);      ! 1 8. ! 1 9.    // Resource requirements for worker containers! 2 0.    Resource capability = Records.newRecord(Resource.class);! 2 1.    capability.setMemory(128);! 2 2.    capability.setVirtualCores(1);! 2 3. ! 2 4.    // Make container requests to ResourceManager! 2 5.    for (int i = 0; i < n; ++i) {! 2 6.      ContainerRequest containerAsk = new ContainerRequest(capability, null, null, priority);! 2 7.      rmClient.addContainerRequest(containerAsk);! 2 8.    } ! Setup requirements for worker containers Make resource requests to ResourceManager
  • 47. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell • ApplicationMaster – Code Walkthrough Page 47 ! 3 0.    // Obtain allocated containers and launch ! 3 1.    int allocatedContainers = 0;! 3 2.    while (allocatedContainers < n) {! 3 3.      AllocateResponse response = rmClient.allocate(0);! 3 4.      for (Container container : response.getAllocatedContainers()) {! 3 5.        ++allocatedContainers;! 3 6. ! 3 7.        // Launch container by create ContainerLaunchContext! 3 8.        ContainerLaunchContext ctx = Records.newRecord(ContainerLaunchContext.class);! 3 9.        ctx.setCommands(Collections.singletonList(”/bin/date”));! 4 0.        nmClient.startContainer(container, ctx);! 4 1.      }! 4 2.      Thread.sleep(100);! 4 3.    } ! Setup requirements for worker containers Make resource requests to ResourceManager
  • 48. © Hortonworks Inc. 2013 Sample YARN Application – DistributedShell • ApplicationMaster – Code Walkthrough Page 48 4     // Now wait for containers to complete! 4     int completedContainers = 0;! 4     while (completedContainers < n) {! 4       AllocateResponse response = rmClient.allocate(completedContainers/n);! 4       for (ContainerStatus status : response.getCompletedContainersStatuses()) {! 5         if (status.getExitStatus() == 0) {! 5           ++completedContainers;! 5         }  ! 5       }! 4       Thread.sleep(100);! 5     }! 5  ! 5     // Un-register with ResourceManager! 5     rmClient.unregisterApplicationMaster(SUCCEEDED, "", "");! Wait for containers to complete successfully Un-register with ResourceManager
  • 49. © Hortonworks Inc. 2013 Apache Hadoop MapReduce on YARN • Original use-case • Most complex application to build – Data-locality – Fault tolerance – ApplicationMaster recovery: Check point to HDFS – Intra-application Priorities: Maps v/s Reduces – Needed complex feedback mechanism from ResourceManager – Security – Isolation • Binary compatible with Apache Hadoop 1.x Page 49
  • 50. © Hortonworks Inc. 2012 NodeManager   NodeManager   NodeManager   NodeManager   map  1.1   reduce2.1   ResourceManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   map1.2   reduce1.1   MR  AM  1   map2.1   map2.2   reduce2.2   MR  AM2   Apache Hadoop MapReduce on YARN Scheduler  
  • 51. © Hortonworks Inc. 2013 Apache Tez on YARN • Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – http://incubator.apache.org/projects/tez.html Page 51 Hive - Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Hive - MR
  • 52. © Hortonworks Inc. 2012 NodeManager   NodeManager   NodeManager   NodeManager   map  1.1   vertex1.2.2   ResourceManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   map1.2   reduce1.1   MR  AM  1   vertex1.1.1   vertex1.1.2   vertex1.2.1   Tez  AM1   Apache Tez on YARN Scheduler  
  • 53. © Hortonworks Inc. 2013 HOYA - Apache HBase on YARN Use cases • Small HBase cluster in large YARN cluster • Dynamic HBase clusters • Transient/intermittent clusters for workflows Hoya: • HBase becomes user-level application • create, start, stop, deletes HBase clusters • Flex cluster size: increase/decrease size with load • Recover from Region Server loss with new container. Page 53 Code: https://github.com/hortonworks/hoya
  • 54. © Hortonworks Inc. 2012 NodeManager   NodeManager   NodeManager   NodeManager   map  1.1   Hbase  Master1.1   ResourceManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   map1.2   reduce1.1   MR  AM  1   RegionServer  1.1   RegionServer  1.2   RegionServer  1.3   HOYA  AM1   HOYA - Apache HBase on YARN Scheduler  
  • 55. © Hortonworks Inc. 2013 HOYA - Highlights • Cluster specification stored as JSON in HDFS • Config directory cached - dynamically patched before pushing up as local resources for Master & RegionServers • HBase tar file stored in HDFS -clusters can use the same/different HBase versions • Handling of cluster flexing is the same code as unplanned container loss. • No Hoya code on RegionServers: client and AM only Page 55
  • 56. © Hortonworks Inc. 2013 Storm on YARN • Ability to deploy multiple Storm clusters on YARN for real-time event processing • Yahoo – Primary contributor – 200+ nodes in production • Ability to recover from faulty nodes – Get new containers • Auto-scale for load balancing – Get new containers as load increases – Release containers as load decreases Page 56 Code: https://github.com/yahoo/storm-yarn
  • 57. © Hortonworks Inc. 2012 NodeManager   NodeManager   NodeManager   NodeManager   map  1.1   Nimbus1.4   ResourceManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   NodeManager   map1.2   reduce1.1   MR  AM  1   Nimbus1.1   Nimbus1.2   Nimbus1.3   Storm  AM1   Storm on YARN Scheduler  
  • 58. © Hortonworks Inc. 2013 General Architectural Considerations • Fault Tolerance – Checkpoint • Security • Always-On services • Scheduler features – Whitelist resources – Blacklist resources – Labels for machines – License management Page 58
  • 59. © Hortonworks Inc. 2013 Agenda • Why YARN • YARN Architecture and Concepts • Building applications on YARN • Next Steps Page 59
  • 60. © Hortonworks Inc. 2013 HDP 2.0 Community Preview & YARN Certification Program Goal: Accelerate # of certified YARN-based solutions •  HDP 2.0 Community Preview – Contains latest community Beta of Apache Hadoop 2.0 & YARN – Delivered as easy to use Sandbox VM, as well as RPMs and Tarballs – Enables YARN Cert Program Community & commercial ecosystem to test and certify new and existing YARN-based apps •  YARN Certification Program – More than 14 partners in program at launch –  Splunk* –  Elastic Search* –  Altiscale* –  Concurrent* –  Microsoft –  Platfora –  Tableau –  (IBM) DataStage –  Informatica –  Karmasphere –  and others * Already certified
  • 61. © Hortonworks Inc. 2013 Technical Resources • hadoop-2.1.0-beta: –  http://hadoop.apache.org/common/releases.html • Release Documentation: –  http://hadoop.apache.org/common/docs/r2.1.0-beta/ • Blogs: –  http://hortonworks.com/blog/category/apache-hadoop/yarn/ –  http://hortonworks.com/blog/introducing-apache-hadoop-yarn/ –  http://hortonworks.com/blog/apache-hadoop-yarn-background- and-an-overview/ Page 61
  • 62. © Hortonworks Inc. 2013 What Next? • Download the Book • Download the Community preview • Join the Beta Program • Follow us… @hortonworks • THANK YOU Page 62 http://www.hortonworks.com/hadoop/yarn