Yarn

YARN
Yet Another Resource Negotiator

CC BY 2.0 / Richard Bumgardner

Been there, done that.

Agenda
•  Why
YARN?

•  YARN
Architecture
and
Concepts

•  Resources
&
Scheduling

–  Capacity
Scheduler

–  Fair
Scheduler

•  Conﬁguring
the
Fair
Scheduler

•  Managing
Running
Jobs

The 1st Generation of Hadoop: Batch
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App

BATCH
HDFS
Single App

INTERACTIVE
Single App

BATCH
HDFS
•  All other usage
patterns must
leverage that same
infrastructure
•  Forces the creation
of silos for managing
mixed workloads
Single App

BATCH
HDFS
Single App

ONLINE

Hadoop MapReduce Classic
•  JobTracker

–  Manages
cluster
resources
and
job
scheduling

•  TaskTracker

–  Per-‐node
agent

–  Manage
tasks

MapReduce Classic: Limitations
•  Scalability

–  Maximum
Cluster
size
–
4,000
nodes

–  Maximum
concurrent
tasks
–
40,000

–  Coarse
synchronizaPon
in
JobTracker

•  Availability

–  Failure
kills
all
queued
and
running
jobs

•  Hard
parPPon
of
resources
into
map
and
reduce
slots

–  Low
resource
uPlizaPon

•  Lacks
support
for
alternate
paradigms
and
services

–  IteraPve
applicaPons
implemented
using
MapReduce
are
10x
slower

Our Vision: Hadoop as Next-Gen Platform
MapReduce

(cluster resource management

& data processing)

HDFS

(redundant, reliable storage)

Single Use System
Batch Apps
HADOOP 1.0
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
HADOOP 2.0
Others

(data processing)

YARN

(cluster resource management)

HDFS2

(redundant, reliable storage)

MapReduce

(data processing)

YARN: Talking Hadoop Beyond Batch
YARN (Cluster Resource Management)

HDFS2 (Redundant, Reliable Storage)

BATCH

(MapReduce)

INTERACTIVE

(Tez)

STREAMING

(Storm, S4,…)

GRAPH

(Giraph)

IN-‐MEMORY

(Spark)

HPC MPI

(OpenMPI)

ONLINE

(HBase)

Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
ApplicaRons Run NaRvely IN Hadoop

OTHER

(Search)
(Weave…)

Why YARN / MR2 ?
• 
Scalability

–  JobTracker
kept
track
of
individual
tasks
and
wouldn’t
scale

•  UPlizaPon

–  All
slots
are
equal
even
if
the
work
is
not
equal

•  MulP-‐tenancy

–  Every
framework
shouldn’t
need
to
write
its
own
execuPon
engine

–  All
frameworks
should
share
the
resources
on
a
cluster

Multiple levels of scheduling
• 
YARN

–  Which
applicaPon
(framework)
to
give
resources
to
?

•  ApplicaPon
(Framework
–
MR
etc.)

–  Which
task
within
the
applicaPon
should
use
these
resources
?

YARN Concepts
•  ApplicaPon

–  ApplicaPon
is
a
job
submi^ed
to
the
framework

–  Example
–
Map
Reduce
Job

•  Container

–  Basic
unit
of
allocaPon

–  Fine-‐grained
resource
allocaPon
across
mulPple
resource
types

(memory,
cpu,
disk,
network,
gpu
etc.)

•  container_0
=
2GB,
1
CPU

•  container_1
=
1GB,
6
CPU

–  Replaces
the
ﬁxed
map/reduce
slots

YARN Architecture
NodeManager
NodeManager
NodeManager
NodeManager

Container 1.1

Container 2.4

ResourceManager

NodeManager
NodeManager
NodeManager
NodeManager

NodeManager
NodeManager
NodeManager
NodeManager

Container 1.2

Container 1.3

AM 1

Container 2.1

Container 2.2

Container 2.3

AM2

Scheduler

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

ApplicaRons

Manager
(AsM)

Architecture
•  Resource
Manager

–  Global
resource
scheduler

–  Hierarchical
queues

•  Node
Manager

–  Per-‐machine
agent

–  Manages
the
life-‐cycle
of
container

–  Container
resource
monitoring

•  ApplicaPon
Master

–  Per-‐applicaPon

–  Manages
applicaPon
scheduling
and
task
execuPon

–  E.g.
MapReduce
ApplicaPon
Master

Design Centre
•  Split
up
the
two
major
funcPons
of
JobTracker

–  Cluster
resource
management

–  ApplicaPon
life-‐cycle
management

•  MapReduce
becomes
user-‐land
library

YARN Architecture - Walkthrough
NodeManager
NodeManager
NodeManager
NodeManager

Container 1.1

Container 2.4

ResourceManager

NodeManager
NodeManager
NodeManager
NodeManager

NodeManager
NodeManager
NodeManager
NodeManager

Container 1.2

Container 1.3

AM 1

Container 2.2

Container 2.1

Container 2.3

AM2

Scheduler

Client2

Control Flow: Submit application

Control Flow: Get application updates

Control Flow: AM asking for resources

Control Flow: AM using containers

Execution Modes
•  Local
mode

•  Uber
mode

Container Types
•  DefaultContainerExecutor

–  Unix’s
process-‐based
Executor
by
using
ulimit

•  LinuxContainerExecutor

–  Linux
container-‐based
Executor
by
using
cgroups

•  Choose
it
based
on
isolaPon
level
you
need

Resource Model and Capacities
•  Resource
vectors

–  e.g.
1024
MB,
2
vcores,
…

–  No
more
task
slots!

•  Nodes
specify
the
amount
of
resources
they
have

–  yarn.nodemanager.resource.memory-‐mb

–  yarn.nodemanager.resource.cpu-‐vcores

•  vcores
to
cores
relaPon,
not
really
“virtual”

Resources and Scheduling
•  What
you
request
is
what
you
get

–  No
more
ﬁxed-‐size
slots

–  Framework/applicaPon
requests
resources
for
a
task

•  MR
AM
requests
resources
for
map
and
reduce
tasks,
these
requests
can

potenPally
be
for
diﬀerent
amounts
of
resources

YARN Scheduling
ResourceManager

ApplicaPon

Master
1

ApplicaPon

Master
2

Node
1
Node
2
Node
3

I want 2 containers
with 1024 MB and a
1 core each
Noted

I’m
sPll

here

I’ll
reserve
some
space on
node1
for AM1
Got anything
for me?

Here’s a security
token to let you
launch a container
on Node 1

Hey, launch my
container with this
shell command

Container

YARN Schedulers
•  Same
as
MR1

•  FIFO
Scheduler

–  Processing
Jobs
in
order

•  Fair
Scheduler

–  Fair
to
all
users,
dominant
fair
scheduler

•  Capacity
Scheduler

–  Queue
shares
as
percentage
of
clusters

–  FIFO
scheduling
within
each
queue

–  SupporPng
preempPon

•  Default
is
Capacity
Scheduler

Capacity Scheduler
50%

queue-‐1
queue-‐2
queue-‐3

Apps
Apps
Apps

Guaranteed

Resources

30%
20%

YARN Capacity Scheduler
•  ConﬁguraPon
in
capacity-‐scheduler.xml

•  Take
some
Pme
to
setup
your
queues!

•  Queues
have
per-‐queue
ACLs
to
restrict
queue
access

–  Access
can
be
dynamically
changed

•  ElasPcity
can
be
limited
on
a
per-‐queue
basis

–  use
yarn.scheduler.capacity.<queue-‐path>.maximum-‐capacity

•  Use
yarn.scheduler.capacity.<queue-‐path>.state
to
drain

queues

–  ‘Decommissioning’
a
queue

•  yarn
rmadmin
–refreshQueues
to
make
runPme
changes

YARN Fair Scheduler
•  The
Fair
Scheduler
is
the
default
YARN
scheduler
in
CDH5

•  The
only
YARN
scheduler
that
Cloudera
recommends
for

producPon
clusters

•  Provides
ﬁne-‐grained
resource
allocaPon
for
mulPple

resource
types

–  Memory
(by
default)

–  CPU
(opPonal)

Goals of the Fair Scheduler
•  Should
allow
short
interacPve
jobs
to
coexist
with
long

producPon
jobs

•  Should
allow
resources
to
be
controlled
proporPonally

•  Should
ensure
that
the
cluster
is
eﬃciently
uPlized

The Fair Scheduler
•  The
Fair
Scheduler
promotes
fairness
between
schedulable

enPPes

•  The
Fair
Scheduler
awards
resources
to
pools
that
are
most

underserved

–  Gives
a
container
to
the
pool
that
has
the
fewest
resources
allocated

Fair Scheduler Pools
•  Each
job
is
assigned
to
a
pool

–  Also
known
as
a
queue
in
YARN

terminology

•  All
pools
in
YARN
descend
from
the

root
pool

•  Physical
resource
are
not
bound
to

any
specific
pool

•  Pools
can
be
predefined
or
defined

dynamically
by
specifying
a
pool

name
when
you
submit
a
job

•  Pools
and
subpools
are
defined
in

the
fair-scheduler.xml file

Total:
30GB
Alice Bob
15GB15GB

In Which Pool Will a Job Run
•  The
default
pool
for
a
job
is
root.username
–  For
example,
root.Alice
and
root.Bob
–  You
can
drop
root
when
referring
to
a
pool

•  For
example,
you
can
refer
to
root.Alice
simply
as
Alice
•  Jobs
can
be
assigned
to
arbitrarily-‐named
pools

–  To
specify
the
pool
name
when
submirng
a
MapReduce
job,
use

•  -D mapreduce.job.queuename

When Will a Job Run Within a Pool?
•  The
Fair
Scheduler
grants
resources
to
a
pool,
but
which
job’s

task
will
get
resources?

•  The
policies
for
assigning
resources
to
jobs
within
a
pool
are

deﬁned
in
fair-scheduler.xml
•  The
Fair
Scheduler
uses
three
techniques
for
prioriPzing
jobs

within
pools:

–  Single
resource
fairness

–  Dominant
resource
fairness

–  FIFO

•  You
can
also
conﬁgure
the
Fair
Scheduler
to
delay
assignment

of
resources
when
a
preferred
rack
or
node
is
not
available

Single Resource Fairness
•  Single
resource
fairness

–  Is
the
default
Fair
Scheduler
policy

–  Schedules
jobs
using
memory

•  Example

–  Two
pools:
Alice
has
15GB
allocated,
and
Bob
has
5GB

–  Both
pools
request
a
10GB
container
of
memory

–  Bob
has
less
resources
and
will
be
granted
the
next
10GB
that

becomes
available

Total:
30GB
Alice Bob
10GB
15GB
5GB

Adding Pools Redistributes Resources
•  The
user
Charlie
now
submits
a
job
to
a
new
pool

–  Resource
allocaPons
are
adjusted

–  Each
pool
receives
a
fair
share
of
cluster
resources

Total:
30GB
Alice Bob Charlie
10GB 10GB10GB

Determining the Fair Share
•  The
fair
share
of
resources
assigned
to
the
pool
is
based
on

–  The
total
resources
available
across
the
cluster

–  The
number
of
pools
compePng
for
cluster
resources

•  Excess
cluster
capacity
is
spread
across
all
pools

–  The
aim
is
to
maintain
the
most
even
allocaPon
possible
so
every
pool

receives
its
fair
share
of
resources

•  The
fair
share
will
never
be
higher
than
the
actual
demand

•  Pools
can
use
more
than
their
fair
share
when
other
pools
are

not
in
need
of
resources

–  This
happens
when
there
are
no
tasks
eligible
to
run
in
other
pools

Minimum Resources
•  A
pool
with
minimum
resources
deﬁned
receives
priority

during
resource
allocaPon

•  The
minimum
resources,
minResources,
are
the
minimum

amount
of
resources
that
must
be
allocated
to
the
pool
prior

to
fair
share
allocaPon

–  Minimum
resources
are
allocated
to
each
pool
assuming
there
is

cluster
capacity

–  Pools
that
have
minimum
resources
speciﬁed
will
receive
priority
in

resource
assignment

Minimum Resource Allocation Example
•  First,
ﬁll
up
the
Production
pool
to
the
20GB
minimum

guarantee

•  Then
distribute
the
remaining
10GB
evenly
across
Alice
and

Bob
Total:
30GB
ProducPon BobAlice
Demand:

100GB

Demand:
30GB

Demand:
25GB
minResources:
20GB
5GB 5GB
20GB

Minimum Resource Allocation Example 2:
Production Pool Empty
•  Production
has
no
demand,
so
no
resources
are
allocated

to
it

•  All
resources
are
allocated
evenly
between
Alice
and
Bob
Total:
30GB
ProducPon
15GB
BobAlice
Demand:

0GB

Demand:
30GB

Demand:
25GB
minResources:
20GB
15GB

•  Combined
minResources
of
Production
and
Research

exceed
capacity

•  Minimum
resources
are
assigned
proporPonally
based
on

deﬁned
minResources
unPl
available
resources
are
exhausted

•  No
memory
remains
for
pools
without
minResources
deﬁned

(i.e.,Bob)

MinResources Exceed Resources
Total:
30GB
ProducPon BobResearch
Demand:

100GB

Demand:
30GB

Demand:
25GB
minResources:

50GB

minResources:
25GB
20GB
10GB

•  Production
is
ﬁlled
to
minResources

•  Remaining
25GB
is
distributed
across
all
pools

•  Production
pool
receives
more
than
its
minResources,
to

maintain
fairness

MinResources < Fair Share
Total:
30GB
ProducPon BobAlice
Demand:

100GB

Demand:
30GB

Demand:
25GB
minResources:

5GB
10GB 10GB10GB

Pools with Weights
•  Instead
of
(or
in
addiPon
to)
serng
minResources,
pools
can

be
assigned
a
weight

•  Pools
with
higher
weight
receive
more
resources
during

allocaPon

•  ‘Even
water
glass
height’
analogy:

–  Think
of
the
weight
as
controlling
the
‘width’
of
the
glass

Example: Pool with Double Weight
•  Production
is
ﬁlled
to
minResources
(5Gb)

•  Remaining
25GB
is
distributed
across
all
pools

•  Bob
pool
receives
twice
the
amount
of
memory
during
fair

share
allocaPon

Total:
30GB
ProducPon BobAlice
Demand:

100GB

Demand:
30GB

Demand:
25GB
minResources:

5GB

Weight:
2
8GB 14GB8GB

Dominant Resource Fairness
•  The
Fair
Scheduler
can
be
conﬁgured
to
schedule
with
both

memory
and
CPU
using
dominant
resource
fairness

•  Scenario
#1:

–  Alice
has
6GB
and
3
cores,
and
Bob
has
4GB
and
2
cores
–
which

pool
receives
the
next
resource
allocaPon?

•  Bob
will
receive
the
next
container
because
it
has
less

memory
and
less
CPU
cores
allocated
than
Alice

6GB
3
cores
4GB
2
cores
Alice
Usage Bob
Usage

Dominant Resource Fairness Example
•  Scenario
#2:

–  A
cluster
has
10GB
of
total
memory
and
20
cores

–  Pool
Alice
has
containers
granted
for
4GB
of
memory
and
5
cores

–  Pool
Bob
has
containers
granted
for
1GB
of
memory
and
10
cores

•  Alice
will
receive
the
next
container
because
its
40%

dominant
share
of
memory
is
less
than
the
Bob
pool’s
50%

dominant
share
of
CPU

4GB

40%

capacity
5
cores

25%

capacity
1GB

10%

capacity
10
cores

50%

capacity
Alice
Usage Bob
Usage

Achieving Fair Share: The Patient Approach
•  If
shares
are
imbalanced,
pools
which
are
over
their
fair
share

may
not
assign
new
tasks
when
their
old
ones
complete

–  Those
resources
then
become
available
to
pools
which
are
operaPng

below
their
fair
share

•  However,
waiPng
paPently
for
a
task
in
another
pool
to
ﬁnish

may
not
be
acceptable
in
a
producPon
environment

–  Tasks
could
take
a
long
Pme
to
complete

Achieving Fair Share: The Brute Force Approach
•  With
preempPon
enabled,
the
Fair
Scheduler
acPvely
kills

tasks
that
belong
to
pools
operaPng
over
their
fair
share

–  Pools
operaPng
below
fair
share
receive
those
reaped
resources

•  There
are
two
types
of
preempPon
available

–  Minimum
share
preempPon

–  Fair
share
preempPon

•  PreempPon
code
avoids
killing
a
task
in
a
pool
if
it
would

cause
that
pool
to
begin
preempPng
tasks
in
other
pools

–  This
prevents
a
potenPally
endless
cycle
of
pools
killing
one
another’s

tasks

Minimum Share Preemption
•  Pools
with
a
minResources
configured
are
operaPng
on
an

SLA
(Service
Level
Agreement)

•  Pools
that
are
below
their
minimum
share
as
defined
by

minResources
can
preempt
tasks
in
other
pools

–  Set
minSharePreemptionTimeout
to
the
number
of
seconds

the
pool
is
under
its
minimum
share
before
preempPon
should
begin

–  Default
is
infinite
(Java’s
Long.MAX_VALUE)

Fair Share Preemption
•  Pools
not
receiving
their
fair
share
can
preempt
tasks
in
other

pools

–  Only
pools
that
exceed
their
fair
share
are
candidates
for
preempPon

•  Use
fair
share
preempPon
conservaPvely

–  Set
fairSharePreemptionTimeout
to
the
number
of
seconds
a

pool
is
under
fair
share
before
preempPon
should
begin

–  Default
is
inﬁnite
(Java’s
Long.MAX_VALUE)

Configuring Fair Scheduler Capabilities (1)
•  yarn.scheduler.fair.allow-‐undeclared-‐pools
(yarn-‐site.xml)

–  When
true,
new
pools
can
be
created
at
applicaPon
submission
Pme

or
by
the
user-‐as-‐default-‐queue
property.
When
false,
submirng
to
a

pool
that
is
not
specified
in
the
causes

the
applicaPon
to
be
placed
in
the
“default”
pool.
Default:
true.

Ignored
if
a
pool
placement
policy
is
defined
in
the
fair-
scheduler.xml file.

•  yarn.scheduler.fair.preempPon
(yarn-‐site.xml)

–  Enables
preempPon
in
Fair
Scheduler.
Set
to
true
if
you
have
pools

that
must
operate
on
an
SLA.
Default:
false.

•  yarn.scheduler.fair.user-‐as-‐default-‐queue
(yarn-‐site.xml)

–  Send
jobs
to
pools
based
on
users’
names
instead
of
to
the
default

pool,
root.default.
Default:
true

Configuring Fair Scheduler Capabilities (2)
•  yarn.scheduler.fair.locality.threshold.node

、

yarn.scheduler.fair.locality.threshold.rack
(yarn-‐site.xml)

–  For
applicaPon
that
request
containers
on
parPcular
nodes
or
racks,

the
number
of
scheduling
opportuniPes
since
the
last
container

assignment
to
wait
before
accepPng
a
placement
on
another
node.

Expressed
as
a
ﬂoat
between
0
and
1,
which,
as
a
fracPon
of
the

cluster
size,
is
the
number
of
scheduling
opportuniPes
to
pass
up..

Default:
1
(don’t
pass
up
any
scheduling
opportuniPes)

•  Example:
yarn.scheduler.fair.locality.threshold.node
=
0.02,

cluster
size
=
100
nodes.
At
most
2
scheduling
opportuniPes
can
be
skipped
when

preferred
placement
cannot
be
met.

Configuring Resource Allocation for Pools and Users (1)
•  You
configure
Fair
Scheduler
pools
in
the
/etc/hadoop/
conf/fair-scheduler.xml
file

•  The
Fair
Scheduler
rereads
this
file
every
10
seconds

–  ResourceManager
restart
is
not
required
when
the
file
changes

•  The
must
contain
an

<allocations> element

•  Use
the
<queue> element
to
configure
resource
allocaPon

for
a
pool

•  Use
the
<user> element
to
configure
resource
allocaPon

for
a
user
across
mulPple
pools

•  To
specify
resource
allocaPons,
use
the
<queue> or

<user> element
with
any
or
all
of
the
the
following

subelements

–  <minResources>
•  The
minimum
resources
to
which
the
pool
is
enPtled

•  Format
is
x mb, y vcores
•  Example:
10000mb, 5 vcores
–  <maxResources>
•  The
maximum
resources
to
which
the
pool
is
enPtled

•  Format
is
x mb, y vcores

•  AddiPonal
sub-‐elements
of
<queue> or
<user> to
use

when
specifying
resource
allocaPons

–  <maxRunningApps>
•  The
maximum
applicaPons
in
the
pool
that
can
be
run
concurrently

–  <weight>
•  Used
for
non-‐proporPonate
sharing
with
other
pools

•  The
default
is
1

–  <minSharePreemptionTimeout>
•  Time
to
wait
before
pre-‐empPng
tasks

–  <schedulingPolicy>
•  SRF
for
single
resource
fairness
(the
default)

•  SRF
for
dominant
resource
fairness

•  FIFO
for
ﬁrst-‐in,
ﬁrst-‐out

fair-scheduler.xml Example (1)
•  Allow
users
to
run
three
jobs,
but
allow
Bob
to
run
six
jobs

<?xml
version=“1.0”?>

<allocaPons>

<userMaxAppsDefault>3</userMaxAppsDefault>

<user
name=“bob”>

<maxRunningApps>6</maxRunningApps>

</user>

</allocaPons>

•  Add
a
fair
share
Pmeout

<?xml
version=“1.0”?>

<allocaPons>


<user
name=“bob”>

<maxRunningApps>6</maxRunningApps>

</user>

<fairSharePreempPonTimeout>300</fairSharePreempPonTimeout>

</allocaPons>

•  Deﬁne
the
producPon
pool
with
a
weight
of
2
and
a
resource

allocaPon
of
10000
MB
and
1
core

<?xml
version=“1.0”?>

<allocaPons>


<queue
name=“producPon”>

<minResources>10000
mb,
1
vcores</minResources>

<weight>2.0</weight>

</queue
>

</allocaPons>

•  Add
an
SLA
to
the
producPon
pool

<?xml
version=“1.0”?>

<allocaPons>


<queue
name=“producPon”>

<minResources>10000
mb,
1
vcores</minResources>

<weight>2.0</weight>

<minSharePreempPonTimeout>60</minSharePreempPonTimeout>

</queue
>

</allocaPons>

The Fair Scheduler User Interface
•  h^p://<resource_manager_host>:8088/cluster/scheduler

Displaying Jobs
•  To
view
jobs
currently
running
on
the
cluster

–  yarn application –list
–  List
all
running
jobs,
including
the
applicaPon
ID
for
each

•  To
view
all
jobs
on
the
cluster,
including
completed
jobs

–  yarn application –list all
•  To
display
the
status
of
an
individual
job

–  yarn application –status <application_ID>
•  You
can
also
use
the
ResourceManager
Web
UI,
Hue,
Ambari,

Cloudera
Manager
to
display
jobs

Killing Jobs
•  It
is
important
to
note
that
once
a
user
has
submi^ed
a
job,

they
can
not
stop
it
just
by
hirng
CTRL-‐C
on
their
terminal

–  This
stops
job
output
appearing
on
the
user’s
console

–  The
job
is
sPll
running
on
the
cluster!

•  To
kill
a
job
running
on
the
cluster

–  yarn application –kill <application_ID>
•  You
can
also
kill
job
from
Cloudera
Manager

Yarn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Yarn

Similar to Yarn (20)

Recently uploaded

Recently uploaded (20)

Yarn