Intro to YARN and MapReduce 2

Introduction to YARN and
MapReduce 2

© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

1
201310b

Course Objectives
 Intended Audience
–Developers, data analysts and system administrators familiar with
MapReduce 1

 After attending this course, you will understand
–The motive behind YARN and MapReduce 2
–Key differences between MapReduce 1 and 2
–How YARN manages resources in a cluster
–The lifecycle of an MRv2 job on a YARN cluster
–Tools for managing a YARN cluster
–How MapReduce 2 fits into Cloudera Enterprise and the Enterprise Data
Hub


2

Course Topics
MapReduce 2
 Overview of MapReduce 1 and 2
 YARN Architecture
 MapReduce v2
 Managing a YARN Cluster

 Cloudera and MRv2
 Conclusion


3

MRv1 and MRv2
 MapReduce 1 (“Classic”) has three main components
–API – for user-level programming of MR applications
–Framework – runtime services for running Map and Reduce
processes, shuffling and sorting, etc.
–Resource management – infrastructure to monitor nodes, allocate
resources, and schedule jobs
 MapReduce 2 (“NextGen”) moves Resource Management into YARN
MapReduce 2
MapReduce 1
API

MR API
Framework

Framework
Resource
Management

YARN
YARN API
Resource Management


4

MapReduce2 History
 Originally architected at Yahoo in 2008
 “Alpha” in Hadoop 2 pre-GA
–Included in CDH 4
 YARN promoted to Apache Hadoop
sub-project
–summer 2013
 “Production ready” in Hadoop 2 GA
–Included in CDH5 (Beta in Oct 2013)

Hadoop 0.20
HDFS

MRv1

Hadoop Common

Hadoop 2.0 (pre-GA)
HDFS

MRv2/YARN

Hadoop Common

Hadoop 2.2 (GA)
HDFS

MRv2
YARN

Hadoop Common


5

Chapter Topics
MapReduce 2
 Introduction: MapReduce 1 and 2
 MapReduce 2

 Conclusion


6

Why is YARN needed? (1)
 MRv1 resource management issues
–Inflexible “slots” configured on nodes – Map or Reduce, not both
–Underutilization of cluster when more map or reduce tasks are
running
–Can’t share resources with non-MR applications running on Hadoop
cluster (e.g. Impala, Giraph)
–Scalability – one JobTracker per cluster – limit of about 4000 nodes per
cluster


7

Why is YARN needed? (2)
 YARN solutions
–No slots
–Nodes have “resources” – memory and CPU cores – which are
allocated to applications when requested
–Supports MR and non-MR applications running on the same cluster
–Most Job Tracker functions moved to Application Master – one cluster
can have many Application Masters
Resource
Manager
Job Tracker

Application
Application
Master
Application
Master
Master


8

YARN Daemons
 Resource Manager (RM)
–Runs on master node
–Global resource scheduler
–Arbitrates system resources between competing
applications

 Node Manager (NM)
–Runs on slave nodes
–Communicates with RM

Resource
Manager

NodeManager

NodeManager


9

Running an Application in YARN
 Containers
–Created by the RM upon request
–Allocate a certain amount of resources
(memory, CPU) on a slave node
–Applications run in one or more containers
 Application Master (AM)
–One per application
–Framework/application specific
–Runs in a container
–Requests more containers to run application
tasks

NodeManager
1 Gb
1 core

3 Gb
1 core

NodeManager
Application
Master


10

YARN Cluster
NodeManager

NodeManager

NodeManager

Resource
Manager
NodeManager


11

YARN Cluster: Running an Application
NodeManager

Client

NodeManager

Application:
MyApp

Launch

NodeManager
Application
Master

Resource
Manager
NodeManager


12

NodeManager

Client

NodeManager

Application:
MyApp

NodeManager

Resource
Manager

Resource Request

Application
Master

Container IDs

NodeManager


13

Client

Application:
MyApp

NodeManager

NodeManager
MyApp
Launch

NodeManager
Application
Master

Resource
Manager

Launch

NodeManager
MyApp


14

Client

Application:
MyApp

Client

Application:
YourApp

NodeManager

NodeManager
Application
Master

MyApp

NodeManager
Application
Master

Resource
Manager
NodeManager

MyApp


15

Client

Client

NodeManager
YourApp

Application:
MyApp

Application:
YourApp

NodeManager
Application
Master

MyApp

NodeManager

Resource
Manager

YourApp

Application
Master

NodeManager
MyApp


16

YARN Schedulers (1)
 Pluggable in Resource Manager
 YARN includes two schedulers
–CapacityScheduler
–FairScheduler
 How are these different than MRv1 schedulers?
–Support any YARN application, not just MR
–No more “slots” – tasks are allocated based on resources (memory and
CPU for now)
–FairScheduler: pools are now called queues


17

YARN Schedulers (2)
 Hierarchical queues
–Queues can contain subqueues
–Sub-queues share the
resources assigned to
queues

Website
Logs ETL
Product A

Product B

Fast
Lane
Regular
Data Anal

Engineering
weight=2

Marketing
weight=1


18

Resource Manager Things to Know
 What it does
Resource
–Manages nodes
Manager
–Tracks heartbeats from NodeManagers
–Manages containers
–Handles AM requests for resources
–De-allocates containers when they expire or the application
completes
–Manages ApplicationMasters
–Creates a container for AMs and tracks heartbeats
–Manages security
–Supports Kerberos


19

Node Manager Things to Know
NodeManager
 What it does
–Communicates with the RM
–Registers and provides info on node resources
–Sends heartbeats and container status
–Manages processes in containers
–Launches AMs on request from the RM
–Launches application processes on request from AM
–Monitors resource usage by containers; kills run-away processes
–Provides logging services to applications
–Aggregates logs for an application and saves them to HDFS
–Runs auxiliary services
–Maintains node level security via ACLs


20

Resource Requests
Resource Request
• Resource name (hostname, rackname or *)
• Priority (within this application, not between applications)
• Resource requirements
• memory (MB)
• CPU (# of cores)
• more to come, e.g. disk and network I/O, GPUs, etc.
• Number of containers
NodeManager
Application
Master

Resource
Manager
Container(s)
• ID
• Node


21

Launch Container

Container Launch Context
• Container ID
• Commands (to start application)
• Environment (configuration)
• Local Resources (e.g. application
binary, HDFS files)

NodeManager
MyApp

NodeManager
Application
Master


22

Non-MR2 YARN Applications
 Distributed Shell
 Impala
 Apache Giraph
 Spark
 Others
–http://wiki.apache.org/hadoop/PoweredByYarn


23

Chapter Topics
Introductions to YARN and
MapReduce 2
 MapReduce 2

 Conclusion


24

YARN and MapReduce
 YARN does not know or care what kind of application it is running
–Could be MR or something else (e.g. Impala)
 MR2 uses YARN
–Hadoop includes a MapReduce ApplicationMaster (MRAppMaster) to
manage MR jobs
–Each MapReduce job is an a new instance of an application
MapReduce 2
MR API

NodeManager
MRAppMaster

Framework

YARN
YARN API
Resource Management

25

Running a MapReduce Application in MRv2
NodeManager

DataNode

NodeManager

DataNode

MR Job
History
Server

NodeManager

DataNode

Name
Node(s)

NodeManager

DataNode

Resource
Manager


26

NodeManager

HDFS

DataNode
Block1

NodeManager

DataNode
Block2

NodeManager

DataNode

NodeManager

mydata

MR Job
History
Server

DataNode

Resource
Manager

Name
Node(s)


27

$ hadoop jar wc.jar
WordCount mydata output

NodeManager

DataNode
Block1

Client
NodeManager

DataNode
Block2

Launch

NodeManager

Resource
Manager

DataNode
MRAppMaster

NodeManager

MR Job
History
Server

Name
Node(s)

DataNode


28

$ hadoop jar wc.jar

NodeManager

DataNode
Block1

Client
NodeManager

DataNode
Block2

Resource Request:
- 1 x Node1/1GB/1 core
NodeManager
- 1 x Node2/1GB/1 core

Resource
Manager

DataNode
MRAppMaster

NodeManager

MR Job
History
Server

Name
Node(s)

DataNode


29

$ hadoop jar wc.jar

NodeManager

DataNode
Block1

Client
NodeManager

DataNode
Block2

NodeManager

Resource
Manager

DataNode
MRAppMaster

MR Job
History
Server

Name
Node(s)

“Here are your containers”

NodeManager

DataNode


30

$ hadoop jar wc.jar

NodeManager
WordCount
Map Task

DataNode
Block1

Client
NodeManager
WordCount
Map Task

NodeManager

Resource
Manager

DataNode
Block2

DataNode
MRAppMaster

NodeManager

MR Job
History
Server

Name
Node(s)

DataNode


31

$ hadoop jar wc.jar

NodeManager
WordCount
Map Task

DataNode
Block1

Client
NodeManager
WordCount
Map Task

Resource Request:
NodeManager
- 2 x */1GB/1 core

Resource
Manager

DataNode
Block2

DataNode
MRAppMaster

NodeManager

MR Job
History
Server

Name
Node(s)

DataNode


32

$ hadoop jar wc.jar

NodeManager

DataNode
Block1

Client
NodeManager

DataNode
Block2

NodeManager

Resource
Manager

DataNode
MRAppMaster

MR Job
History
Server

Name
Node(s)

“Here are your containers”

NodeManager

DataNode


33

$ hadoop jar wc.jar

NodeManager

DataNode
Block1

Client
NodeManager

DataNode
Block2
WordCount
Reduce Task

NodeManager

Resource
Manager

DataNode
MRAppMaster

NodeManager

MR Job
History
Server

Name
Node(s)

DataNode
WordCount
Reduce Task


34

$ hadoop jar wc.jar

NodeManager

DataNode
Block1

Client
NodeManager

DataNode
Block2
WordCount
Reduce Task

NodeManager

Resource
Manager

DataNode
MRAppMaster

“I’m done!”

NodeManager

MR Job
History
Server

Name
Node(s)

DataNode
WordCount
Reduce Task


35

The MapReduce Framework on YARN
 In YARN, Shuffle is run as an auxiliary service
–Runs in the NodeManager JVM as a persistent service
NodeManager

HDFS input

Map Task
Shuffle
Service

Intermediate
Data (local)

NodeManager
Reduce Task

HDFS output

NodeManager
Reduce Task

HDFS output


36

Fault Tolerance
 Any of the following can fail
–Task (Container) – Handled just like in MRv1
–MRAppMaster will re-attempt tasks that complete with exceptions
or stop responding (4 times by default)
–Applications with too many failed tasks are considered failed


37

Fault Tolerance
–Task (Container) – Handled just like in MRv1
–MRAppMaster will re-attempt tasks that complete with exceptions
or stop responding (4 times by default)
–Applications with too many failed tasks are considered failed
–Application Master
–If application fails or if AM stops sending heartbeats, RM will reattempt the whole application (2 times by default )
–MRAppMaster optional setting: Job recovery
• if false, all tasks will re-run
• if true, MRAppMaster retrieves state of tasks when it restarts;
only incomplete tasks will be re-run


38

Fault Tolerance
–NodeManager
–If NM stops sending heartbeats to RM, it is removed from list of
active nodes
–Tasks on the node will be treated as failed by MRAppMaster
–If the App Master node fails, it will be treated as a failed application
–ResourceManager
–No applications or tasks can be launched if RM is unavailable
–Can be configured with High Availability

Resource
Manager
(active)

Resource
Manager
(standby)


39

Chapter Topics
MapReduce 2
 MapReduce 2

 Migrating from MRv1 to MRv2
 Conclusion


40

Resource Manager UI: Nodes
http://rmhost:8088/cluster/nodes

Cluster Overview

link to Node
Manager UI

List of each node
in cluster


41

Resource Manager UI: Applications
http://rmhost:8088/cluster/apps

Cluster Overview

Link to
Application
Details…

List of running and
recent applications


42

Resource Manager UI: Application Detail
http://rmhost:8088/cluster/app/appid

Link to
Application
Master…


43

MRAppMaster UI: Jobs
http://rmhost:8088/proxy/appid

Link to Job
Details…


44

MRAppMaster UI: Tasks
http://rmhost:8088/proxy/appid/mapreduce/job/jobid


45

MR Job History Server
 YARN does not keep track of job history
 MapReduce Job History Server
–Archives job’s metrics and metadata
–Can be accessed through Job History UI or Hue

MR Job
History
Server

http://rmhost:19888/jobhistory


46

Cloudera Manager
 Full support for MR2 added in CM 5


47

Hue
 Hue supports browsing MRv2 jobs
–Connects to Job History Server for “retired” jobs


48

Chapter Topics
MapReduce 2
 MapReduce 2
 Managing a YARN cluster

 Conclusion


49

CDH 4 and CDH 5
CDH 4

 CDH 4 includes MRv2 “preview”

 CDH 5 MRv2 is “production ready”
–Cloudera officially recommends using
MRv2 in CDH5
 CDH 5 supports both MRv1 and MRv2
–Running both on the same cluster is not
supported

MRv1
(production)

MRv2 / YARN
(preview)

CDH 5
MRv2
MRv1
YARN


50

Compatibility
 CDH
–Complete binary compatibility – programs compiled for MRv1 will run
without recompilation
–Source compatibility for almost all programs
–Small number of exceptions noted here:
blog.cloudera.com: Migrating to MapReduce 2 on YARN
Migration Path

CDH4/MR2  CDH5/MR2

Binary Support?

✓
✓
✓


51

MR2 and the Hadoop Ecosystem
Cloudera Enterprise 4
Cloudera Manager
MRv1
 Cloudera 5 includes MR2 support for:
(production)
–Cloudera Manager and Hue
–All ecosystem projects that use MR
–Hive, Pig, Mahout, Crunch, etc.
Cloudera Enterprise 5
–Impala

Cloudera
Manager
MRv2 / YARN
(preview)

Cloudera Manager
MR applications

MRv1

MRv2

Impala
YARN


52

Chapter Topics
MapReduce 2
 MapReduce 2
 Managing a YARN cluster

 Conclusion


53

Key Points
 MRv2 is based on YARN
–YARN replaces JobTracker/TaskTracker architecture of MRv1
–ResourceManager handles resource allocation and scheduling for the
cluster
–Multiple ApplicationMasters handle application-specific job processing
 Why?
–More scalable
–More efficient
–More flexible
 User-visible changes?
–Almost none! Same API, same CLI, very similar web UIs
 MapReduce v2 is production ready in Hadoop 2.2 / CDH 5
–As of CDH 5 Cloudera officially recommends using MRv2

54

Where To Learn More

 Hadoop: The Definitive Guide, 3rd Edition
–Chapter 6 – focuses on how MR is implemented on YARN

 Cloudera Blog posts – blog.cloudera.com/blog/category/yarn
–Migrating to MapReduce 2 on YARN
–Writing Hadoop Programs That Work Across Releases
–and more…


55


56

• Submit questions in the Q&A panel
• Download Cloudera Enterprise 5 beta:
http://go.cloudera.com/c5-beta
• Follow Cloudera University @ClouderaU

• Learn about the Enterprise Data Hub:
http://tinyurl.com/edh-webinar
• Watch on-demand video of this webinar
and many more at http://cloudera.com

Register now for Cloudera training at
http://university.cloudera.com
Use discount code YARN_10 to save 10%
on new enrollments in training classes
delivered by Cloudera until February ‘14*
Use discount code 15off2 to save 15% on
enrollments in two or more classes
delivered by Cloudera until February ‘14*

• Thank you for attending!

* Excludes classes sold or delivered by Cloudera partners

57

Intro to YARN and MapReduce 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to YARN and MapReduce 2

Similar to Intro to YARN and MapReduce 2 (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Intro to YARN and MapReduce 2

Editor's Notes