Impala Resource Management - OUTDATED

1
©
Cloudera,
Inc.
All
rights
reserved.

Impala
Resource
Management:

A
Brief
Overview

MaAhew
Jacobs
|
@maAjacobs

November
2015

Relevant
through
Impala
2.2/CDH5.4

2
©
Cloudera,
Inc.
All
rights
reserved.

Impala
Resource
Management:
Overview

•  Problem:
how
to
best
uIlize
cluster
resources

State
of
the
world
as
of
Impala
2.2/CDH5.4

•  Within
Impala

• READY
FOR
USE:

Built-‐in
Admission
Control
(introduced
in
Impala
1.3/CDH
5.0)

•  Between
Impala
and
the
rest
of
the
world

• READY
FOR
USE:
“StaIc
ParIIoning”
from
Cloudera
Manager

• NOT
READY:
IntegraIon
with
YARN

•  Experimental
integraIon
shipped
in
Impala
1.3/CDH
5.0

•  Some
known
issues
exist,
do
not
use
it
today!
More
on
this
later…

•  We’re
acIvely
working
on
this,
stay
tuned!

3
©
Cloudera,
Inc.
All
rights
reserved.

Talk
Overview

This
is
a
very
brief
overview!
Many
details
we
can’t
cover
in
20min
L

•  How
to
be
successful
today
(including
with
Impala
2.3/CDH5.5)

•  Overview
of
Impala
on
YARN

•  Architecture

•  Why
you
can’t
use
it
yet

•  How
it
might
look
when
you
can

4
©
Cloudera,
Inc.
All
rights
reserved.

“Resource
Management”
Today

•  Use
one
or
both
of:

• StaIc
ParIIoning
with
Cloudera
Manager
(also
called
“StaIc
Resource
Pools”)

• Impala’s
built
in
Admission
Control

•  StaIc
ParIIoning:
dedicate
resources
for
Impala,
HBase,
YARN,
etc.

• Easy
to
use
and
works
well.
Set
up
by
Cloudera
Manager,
uses
cgroups

• E.g.
Impala
gets
100GB/30%
CPU,
HBase
gets
50GB/20%
CPU,
etc.

•  Admission
Control:
throAle
Impala
queries

• Set
a
limit
on
the
max
#
queries
or
max
memory
used
by
those
queries

• E.g.
queue
queries
once
more
than
20
queries
are
running
concurrently,
or

queue
once
more
than
100GB
is
used

5
©
Cloudera,
Inc.
All
rights
reserved.

When
to
Use
AC?
StaIc
ParIIoning?

With
Admission
Control
Without
Admission
Control

With
Sta2c

Par22oning

•  Using
Impala
with
other
systems
(e.g.

Hive,
Spark)
and
need
to
guarantee

each
get
resources

•  Heavy
Impala
workload,
need
to
make

sure
queries
aren’t
stepping
on
each

other

•  Using
Impala
with
other
systems
and

need
to
guarantee
each
get
resources

•  Light
to
moderate
Impala
workload,
not

using
all
available
resources
yet

Without
Sta2c

Par22oning

•  Impala
only
cluster,
or
other
systems

have
very
light,
non-‐compeIng

workloads

•  Heavy
Impala
workload,
need
to
make

sure
queries
aren’t
stepping
on
each

other

•  Enough
cluster
resources
are
available

for
all
workloads
to
consume
as
much
as

necessary

6
©
Cloudera,
Inc.
All
rights
reserved.

(Aside:
A
Plethora
of
Mem
Limits)

•  Process
(impalad)
memory
limit

•  Max
memory
the
process
can
use
across
all
queries.
When
a
query
consumes
memory
such
that
the
process

hits
this
limit
the
query
is
killed

•  Set
with
the
“-‐-‐mem_limit”
impalad
command-‐line
argument,
or
“Impala
Daemon
Memory
Limit”
in
CM.

The
value
is
specified
in
terms
of
single-‐impalad
memory.

•  Pool
(admission
control)
memory
limit

•  Max
memory
the
queries
in
a
pool/queue
can
use.
The
value
is
used
only
to
admit
queries,
not
enforced
once

queries
are
admiAed.
The
value
is
specified
as
the
cluster-‐wide
limit,
i.e.
aggregate
limit
across
all
impalads.

•  hAp://www.cloudera.com/content/cloudera/en/documentaIon/cloudera-‐impala/latest/topics/
impala_admission.html

•  Query
(query
opIon)
memory
limit

•  Max
memory
a
query
can
use;
if
a
query
uses
more
than
it
may
have
to
be
killed
(if
it
can’t
spill).

•  Set
via
the
“set
mem_limit=Xg”
query
opIon.
Can
set
a
default
query
opIon
via
impalad
command-‐line

arguments
(see
the
next
slide).

•  The
value
is
specified
in
terms
of
single-‐impalad
memory,
e.g.
Xg
per
node

•  hAp://www.cloudera.com/content/cloudera/en/documentaIon/cloudera-‐impala/latest/topics/
impala_mem_limit.html

7
©
Cloudera,
Inc.
All
rights
reserved.

Important!
AC
with
Mem
Limits
is
Tricky

•  Admission
based
on
pool
memory
limits
will
use:

• the
query
memory
limit
if
it
is
set

(set
MEM_LIMIT=Xg;)

• Otherwise
falls
back
to
an
esImate
from
planning,
this
is
usually
wrong!

•  Do
not
use
memory
limits
unless
you
set
query
memory
limits

• Consider
serng
a
default
value
for
the
‘mem_limit’
query
opIon

• Set
via
the
‘-‐-‐default_query_opIons’
impalad
argument

• E.g.
-‐-‐default_query_options='mem_limit=5g'

• Can
sIll
override
the
default
with
the
‘set
mem_limit=X;’
query
opIon.

•  Picking
a
good
memory
limit
is
hard,
use
CM’s
charts
to
help
understand
your

workload

8
©
Cloudera,
Inc.
All
rights
reserved.

“Resource
Management”
Today,
Summary

•  Today:
Use
Admission
Control
and
StaIc
ParIIoning

•  We
skipped
over
a
lot
of
details,
see
the
docs
for
more
informaIon

• Impala
Admission
Control:

hAp://www.cloudera.com/content/cloudera/en/documentaIon/cloudera-‐
impala/latest/topics/impala_admission.html

• “StaIc
ParIIoning”
in
Cloudera
Manager:

(also
called
“StaIc
Service
Pools”)

hAp://www.cloudera.com/content/cloudera/en/documentaIon/core/latest/
topics/cm_mc_service_pools.html

•  Ask
us
quesIons
on
impala-‐user@cloudera.org

9
©
Cloudera,
Inc.
All
rights
reserved.

Impala
on
YARN

•  YARN
is
a
“resource
negoIator”
that
helps
share
cluster
resources
within
Hadoop

•  Works
well
for
MapReduce
and
similar
batch-‐oriented
processing
engines

•  Doesn’t
work
well
for
services/frameworks
that
need:

•  Long
running
processes

•  Gang
scheduling

•  Very
low-‐latency
scheduling
requirements

•  Doesn’t
work
so
well
for
Impala

• (And
also
HBase,
MPI,
Presto,
custom
apps,
etc.)

10
©
Cloudera,
Inc.
All
rights
reserved.

Llama
to
the
Rescue

•  Llama
=
Long
Lived
ApplicaIon
MAster

•  On
github:
hAp://cloudera.github.io/llama/index.html

•  An
interface
between:

• YARN’s
ApplicaIonMaster
(AM)
model

(batch
jobs
where
tasks
are
each
a
process,
coordinated
by
an
AM)

• Impala’s
low-‐latency,
in-‐process
query
model

•  Llama
provides:

• Gang-‐scheduling

• “Container”
caching
(to
reduce
resource
acquisiIon
cost)

11
©
Cloudera,
Inc.
All
rights
reserved.

How
Llama
ﬁts
in

1

12
©
Cloudera,
Inc.
All
rights
reserved.

How
Llama
ﬁts
in

1

13
©
Cloudera,
Inc.
All
rights
reserved.

How
Llama
ﬁts
in

1

16
©
Cloudera,
Inc.
All
rights
reserved.

Gang
scheduling

•  YARN
returns
resources
in
a
trickle,
as
they
become
available

•  For
MR
this
is
perfect,
as
tasks
are
mostly
independent
(and

checkpoint
to
disk)

•  For
low-‐latency
queries,
we
require
all
resources
to
be
available
at

once
so
that
query
tasks
can
stream
results
to
one
another

•  Llama
buﬀers
resources
between
YARN
and
Impala
to
make

resource
requests
appear
atomic
and
indivisible

1

17
©
Cloudera,
Inc.
All
rights
reserved.

Resource
caching

• Every
container
requires
YARN
to
make
an
expensive
resource

allocaIon
decision

• We
ask
Llama
to
cache
resources
between
requests

• Containers
stay
in
their
queue
in
Llama,
unIl
YARN
forcefully

reclaims
them

1

18
©
Cloudera,
Inc.
All
rights
reserved.

Impala
on
YARN:
Current
Status

•  Experimental
integraIon
was
shipped
in
Impala
1.4
/
CDH
5.0

•  Not
ready
for
use
yet!

•  A
number
of
known
bugs,
see
umbrella
JIRA
IMPALA-‐2370
to
track

•  Some
(but
not
all)
important
ﬁxes
in
upcoming
Impala
2.3
/
CDH
5.5
release

•  Ongoing
scale
and
performance
tesIng
work
needed
to
provide
guidance

•  In
a
future
release
(post-‐Impala
2.3),
we
will
be
able
to
recommend
usage
for

some
workloads,
w/
guidance

Impala Resource Management - OUTDATED

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Impala Resource Management - OUTDATED

Similar to Impala Resource Management - OUTDATED (20)

Recently uploaded

Recently uploaded (20)

Impala Resource Management - OUTDATED