What's New and Upcoming in HDFS - the Hadoop Distributed File System

What’s
new
and
upcoming
in
HDFS

January
30,
2013

Todd
Lipcon,
SoAware
Engineer

todd@cloudera.com

@tlipcon

1

IntroducGons

•  SoAware
engineer
on
Cloudera’s
Storage
Engineering

team

•  CommiIer
and
PMC
Member
for
Apache
Hadoop
and

Apache
HBase

•  Projects
in
2012

•  Responsible
for
>50%
of
the
code
for
all
phases
of
HA

development

•  Also
worked
on
many
performance
and
stability

improvements

•  This
presentaGon
is
highly
technical
–
please
feel
free

to
grab/email
me
later
if
you’d
like
to
clarify
anything!

©2013 Cloudera, Inc. All Rights
2
Reserved.

Outline

•  HDFS
2.0
–
what’s
new
in
2012?

•  HA
Phase
1
(Q1
2012)

•  HA
Phase
2
(Q2-‐Q4
2012)

•  Performance
improvements
and
other
new
features

•  What’s
coming
in
2013?

•  HDFS
Snapshots

•  BeIer
storage
density
and
ﬁle
formats

•  Caching
and
Hierarchical
Storage
Management

3

Reserved.

HDFS
HA
Phase
1
Review

HDFS-‐1623:
completed
March
2012

4

HDFS
HA
Background

•  HDFS’s
strength
is
its
simple
and
robust
design

•  Single
master
NameNode
maintains
all
metadata

•  Scales
to
mul4-‐petabyte
clusters
easily
on
modern

hardware

•  TradiGonally,
the
single
master
was
also
a
single
point

of
failure

•  Generally
good
availability,
but
not
ops-‐friendly

•  No
hot
patch
ability,
no
hot
reconﬁguraGon

•  No
hot
hardware
replacement

•  Hadoop
is
now
mission
cri4cal:
SPOF
not
OK!

5
Reserved.

HDFS
HA
Development
Phase
1

•  Completed
March
2012
(HDFS-‐1623)

•  Introduced
the
StandbyNode,
a
hot
backup
for
the
HDFS

NameNode.

•  Relied
on
shared
storage
to
synchronize
namespace
state

•  (e.g.
a
NAS
filer
appliance)

•  Allowed
operators
to
manually
trigger
failover
to
the

Standby

•  Sufficient
for
many
HA
use
cases:
avoided
planned

down4me
for
hardware
and
soAware
upgrades,
planned

machine/OS
maintenance,
configuraGon
changes,
etc.

6
Reserved.

HDFS
HA
Architecture
Phase
1

•  Parallel
block
reports
sent
to
AcGve
and
Standby

NameNodes

•  NameNode
state
shared
by
locaGng
edit
log
on
NAS

over
NFS

•  AcGve
NameNode
writes
while
Standby
Node
“tails”

•  Client
failover
done
via
client
configuraGon

•  Each
client
configured
with
the
address
of
both
NNs:
try

both
to
find
acGve

7
Reserved.

HDFS
HA
Architecture
Phase
1

8

Reserved.

Fencing
and
NFS

•  Must
avoid
split-‐brain
syndrome

•  Both
nodes
think
they
are
acGve
and
try
to
write
to
the

same
edit
log.
Your
metadata
becomes
corrupt
and

requires
manual
intervenGon
to
restart

•  Conﬁgure
a
fencing
script

•  Script
must
ensure
that
prior
acGve
has
stopped
wriGng

•  STONITH:
shoot-‐the-‐other-‐node-‐in-‐the-‐head

•  Storage
fencing:
e.g
using
NetApp
ONTAP
API
to
restrict

ﬁler
access

•  Fencing
script
must
succeed
to
have
a
successful

failover

9
Reserved.

Shortcomings
of
Phase
1

•  Insuﬃcient
to
protect
against
unplanned
down4me

•  Manual
failover
only:
requires
an
operator
to
step
in

quickly
aAer
a
crash

•  Various
studies
indicated
this
was
the
minority
of

downGme,
but
sGll
important
to
address

•  Requirement
of
a
NAS
device
made
deployment

complex,
expensive,
and
error-‐prone

(we
always
knew
this
was
just
the
ﬁrst
phase!)

10
Reserved.

HDFS
HA
Development
Phase
2

•  MulGple
new
features
for
high
availability

•  Automa4c
failover,
based
on
Apache
ZooKeeper

•  Remove
dependency
on
NAS
(network-‐aIached
storage)

•  Address
new
HA
use
cases

•  Avoid
unplanned
downGme
due
to
soAware
or
hardware

faults

•  Deploy
in
ﬁler-‐less
environments

•  Completely
stand-‐alone
HA
with
no
external
hardware
or

soAware
dependencies

•  no
Linux-‐HA,
ﬁlers,
etc

11
Reserved.

AutomaGc
Failover
Overview

HDFS-‐3042:
completed
May
2012

12

AutomaGc
Failover
Goals

•  Automa4cally
detect
failure
of
the
AcGve
NameNode

•  Hardware,
soAware,
network,
etc.

•  Do
not
require
operator
interven4on
to
iniGate

failover

•  Once
failure
is
detected,
process
completes
automaGcally

•  Support
manually
ini4ated
failover
as
ﬁrst-‐class

•  Operators
can
sGll
trigger
failover
without
having
to
stop

AcGve

•  Do
not
introduce
a
new
SPOF

•  All
parts
of
auto-‐failover
deployment
must
themselves
be

HA

13
Reserved.

AutomaGc
Failover
Architecture

•  AutomaGc
failover
requires
ZooKeeper

•  Not
required
for
manual
failover

•  ZK
makes
it
easy
to:

•  Detect
failure
of
AcGve
NameNode

•  Determine
which
NameNode
should

become
the
AcGve
NN

14

Reserved.

AutomaGc
Failover
Architecture

•  New
daemon:
ZooKeeper
Failover
Controller
(ZKFC)

•  In
an
auto
failover
deployment,
run
two
ZKFCs

•  One
per
NameNode,
on
that
NameNode
machine

•  ZKFC
has
three
simple
responsibili4es:

•  Monitors
health
of
associated
NameNode

•  ParGcipates
in
leader
elec4on
of
NameNodes

•  Fences
the
other
NameNode
if
it
wins
elecGon

15

Reserved.

AutomaGc
Failover
Architecture

16

Reserved.

Removing
the
NAS
dependency

HDFS-‐3077:
completed
October
2012

17

Shared
Storage
in
HDFS
HA

•  The
Standby
NameNode
synchronizes
the
namespace

by
following
the
AcGve
NameNode’s
transacGon
log

•  Each
operaGon
(eg
mkdir(/foo))
is
wriIen
to
the
log
by
the

AcGve

•  The
StandbyNode
periodically
reads
all
new
edits
and

applies
them
to
its
own
metadata
structures

•  Reliable
shared
storage
is
required
for
correct

opera4on

•  In
phase
1,
shared
storage
was
synonymous
with
NFS-‐
mounted
NAS

18
Reserved.

Shortcomings
of
NFS-‐based
approach

•  Custom
hardware

•  Lots
of
our
customers
don’t
have
SAN/NAS
available
in
their

datacenters

•  Costs
money,
Gme
and
experGse

•  Extra
“stuﬀ”
to
monitor
outside
HDFS

•  We
just
moved
the
SPOF,
didn’t
eliminate
it!

•  Complicated

•  Storage
fencing,
NFS
mount
opGons,
mulGpath
networking,
etc

•  OrganizaGonally
complicated:
dependencies
on
storage
ops

team

•  NFS
issues

•  Buggy
client
implementaGons,
liIle
control
over
Gmeout

behavior,
etc

19
Reserved.

Primary
Requirements
for
Improved
Storage

•  No
special
hardware
(PDUs,
NAS)

•  No
custom
fencing
configuraGon

•  Too
complicated
==
too
easy
to
misconfigure

•  No
SPOFs

•  punGng
to
filers
isn’t
a
good
opGon

•  need
something
inherently
distributed

20
Reserved.

Secondary
Requirements

•  Conﬁgurable
degree
of
fault
tolerance

•  Conﬁgure
N
nodes
to
tolerate
(N-‐1)/2

•  Making
N
bigger
(within
reasonable
bounds)

shouldn’t
hurt
performance.
Implies:

•  Writes
done
in
parallel,
not
pipelined

•  Writes
should
not
wait
on
slowest
replica

•  Locate
replicas
on
exisGng
hardware
investment
(eg

share
with
JobTracker,
NN,
SBN)

21
Reserved.

OperaGonal
Requirements

•  Should
be
operable
by
exisGng
Hadoop
admins.

Implies:

•  Same
metrics
system
(“hadoop
metrics”)

•  Same
conﬁguraGon
system
(xml)

•  Same
logging
infrastructure
(log4j)

•  Same
security
system
(Kerberos-‐based)

•  Allow
exisGng
ops
to
easily
deploy
and
manage
the

new
feature

•  Allow
exisGng
Hadoop
tools
to
monitor
the
feature

•  (eg
Cloudera
Manager,
Ganglia,
etc)

22
Reserved.

Our
soluGon:
QuorumJournalManager

•  QuorumJournalManager
(client)

•  Plugs
into
JournalManager
abstracGon
in
NN
(instead
of

exisGng
FileJournalManager)

•  Provides
edit
log
storage
abstracGon

•  JournalNode
(server)

•  Standalone
daemon
running
on
an
odd
number
of
nodes

•  Provides
actual
storage
of
edit
logs
on
local
disks

•  Could
run
inside
other
daemons
in
the
future

23
Reserved.

Architecture

24
Reserved.

Commit
protocol

•  NameNode
accumulates
edits
locally
as
they
are

logged

•  On
logSync(),
sends
accumulated
batch
to
all
JNs
via

Hadoop
RPC

•  Waits
for
success
ACK
from
a
majority
of
nodes

•  Majority
commit
means
that
a
single
lagging
or
crashed

replica
does
not
impact
NN
latency

•  Latency
@
NN
=
median(Latency
@
JNs)

•  Uses
the
well-‐known
Paxos
algorithm
to
perform

recovery
of
any
in-‐ﬂight
edits
on
leader
switchover

25
Reserved.

JN
Fencing

•  How
do
we
prevent
split-‐brain?

•  Each
instance
of
QJM
is
assigned
a
unique
epoch

number

•  provides
a
strong
ordering
between
client
NNs

•  Each
IPC
contains
the
client’s
epoch

•  JN
remembers
on
disk
the
highest
epoch
it
has
seen

•  Any
request
from
an
earlier
epoch
is
rejected.
Any
from
a

newer
one
is
recorded
on
disk

•  Distributed
Systems
folks
may
recognize
this
technique

from
Paxos
and
other
literature

26
Reserved.

Fencing
with
epochs

•  Fencing
is
now
implicit

•  The
act
of
becoming
acGve
causes
any
earlier
acGve

NN
to
be
fenced
out

•  Since
a
quorum
of
nodes
has
accepted
the
new
acGve,
any

other
IPC
by
an
earlier
epoch
number
can’t
get
quorum

•  Eliminates
confusing
and
error-‐prone
custom
fencing

conﬁgura4on

27
Reserved.

Other
implementaGon
features

•  Hadoop
Metrics

•  lag,
percenGle
latencies,
etc
from
perspecGve
of
JN,
NN

•  metrics
for
queued
txns,
%
of
Gme
each
JN
fell
behind,
etc,

to
help
suss
out
a
slow
JN
before
it
causes
problems

•  Security

•  full
Kerberos
and
SSL
support:
edits
can
be
opGonally

encrypted
in-‐ﬂight,
and
all
access
is
mutually
authenGcated

28
Reserved.

TesGng

•  Randomized
fault
test

•  Runs
all
communicaGons
in
a
single
thread
with

determinisGc
order
and
fault
injecGons
based
on
a
seed

•  Caught
a
number
of
really
subtle
bugs
along
the
way

•  Run
as
an
MR
job:
5000
fault
tests
in
parallel

•  MulGple
CPU-‐years
of
stress
tesGng:
found
2
bugs
in
JeIy!

•  Cluster
tesGng:
100-‐node,
MR,
HBase,
Hive,
etc

•  Commit
latency
in
pracGce:
within
same
range
as
local

disks
(beIer
than
one
of
two
local
disks,
worse
than
the

other
one)

30
Reserved.

Deployment

•  Most
customers
running
3
JNs
(tolerate
1
failure)

•  1
on
NN,
1
on
SBN,
1
on
JobTracker/ResourceManager

•  OpGonally
run
2
more
(eg
on
basGon/gateway
nodes)
to

tolerate
2
failures

•  No
new
hardware
investment

•  Refer
to
docs
for
detailed
conﬁguraGon
info

31
Reserved.

Status

•  Merged
into
Hadoop
development
trunk
in
early

October

•  Available
in
CDH4.1,
will
be
in
upcoming
Hadoop
2.1

•  Deployed
at
several
customer/community
sites
with

good
success
so
far
(no
lost
data)

•  In
contrast,
we’ve
had
several
issues
with
misconﬁgured

NFS
ﬁlers
causing
downGme

•  Highly
recommend
you
use
Quorum
Journaling
instead
of

NFS!

32
Reserved.

Summary
of
HA
Improvements

•  Run
an
acGve
NameNode
and
a
hot
Standby

NameNode

•  AutomaGcally
triggers
seamless
failover
using
Apache

ZooKeeper

•  Stores
shared
metadata
on
QuorumJournalManager:

a
fully
distributed,
redundant,
low
latency
journaling

system.

•  All
improvements
available
now
in
HDFS
branch-‐2
and

CDH4.1

33
Reserved.

HDFS
Performance
Update

34

Performance
Improvements
(overview)

•  Several
improvements
made
for
Impala

•  Much
faster
libhdfs

•  APIs
for
spindle-‐based
scheduling

•  Other
more
general
improvements
(especially
for

HBase
and
Accumulo)

•  Ability
to
read
directly
from
block
ﬁles
in
secure

environments

•  Ability
for
applicaGons
to
perform
their
own
checksums

and
eliminate
IOPS

35
Reserved.

libhdfs
“direct
read”
support
(HDFS-‐2834)

•  This
can
also
beneﬁt
apps
like
HBase,
Accumulo,
and
MR
with
a
bit
more

work
(TBD
in
2013)

36

Disk
locaGons
API
(HDFS-‐3672)

•  HDFS
has
always
exposed
node
locality
informaGon

•  Map<Block,
List<Datanode
Addresess>>

•  Now
also
can
expose
disk
locality
informaGon

•  Map<Replica,
List<Spindle
IdenGﬁers>>

•  Impala
uses
this
API
to
keep
all
disks
spinning
at
full

throughput

•  ~2x
improvement
on
IO-‐bound
workloads
on
12-‐spindle

machines

37
Reserved.

Short-‐circuit
reads

•  “Short
circuit”
allows
HDFS
clients
to
open
HDFS
block

files
directly
from
the
local
filesystem

•  Avoids
context
switches
and
trips
back
and
forth
from
user

space
to
kernel
space
memory,
TCP
stack,
etc

•  Uses
50%
less
CPU,
avoids
significant
latency
when
reading

data
from
Linux
buffer
cache

•  SequenGal
IO
performance:
2x
improvement

•  Random
IO
performance:
3.5x
improvement

•  This
has
existed
for
a
while
in
insecure
setups
only!

•  Clients
need
read
access
to
all
block
files
L

38
Reserved.

Secure
short-‐circuit
reads
(HDFS-‐347)

•  DataNode
conGnues
to
arbitrate
access
to
block
ﬁles

•  Opens
input
streams
and
passes
them
to
the
DFS

client
aAer
authenGcaGon
and
authorizaGon
checks

•  Uses
a
trick
involving
Unix
Domain
Sockets
(sendmsg
with

SCM_RIGHTS)

•  Now
perf-‐sensiGve
apps
like
HBase,
Accumulo,
and

Impala
can
safely
conﬁgure
this
feature
in
all

environments

39
Reserved.

Checksum
skipping
(HDFS-‐3429)

•  Problem:
HDFS
stores
block
data
and
block

checksums
in
separate
files

•  A
truly
random
read
incurs
two
seeks
instead
of
one!

•  Solu4on:
HBase
now
stores
its
own
checksums
on
its

own
internal
64KB
blocks

•  But
it
turns
out
that
prior
versions
of
HDFS
sGll
read
the

checksum,
even
if
the
client
flipped
verificaGon
off

•  Fixing
this
yielded
a
40%
reduc4on
in
IOPS
and

latency
for
a
mulG-‐TB
uniform
random-‐read

workload!

40
Reserved.

SGll
more
to
come?

•  Not
a
ton
leA
on
the
read
path

•  Write
path
sGll
has
some
low
hanging
fruit
–
hang

Gght
for
next
year

•  Reality
check
(mulG-‐threaded
random-‐read)

•  Hadoop
1.0:

264MB/sec

•  Hadoop
2.x:
1393MB/sec

•  We’ve
come
a
long
way
(5x)
in
a
few
years!

41
Reserved.

Other
key
new
features

42

On-‐the-‐wire
EncrypGon

•  Strong
encrypGon
now
supported
for
all
traffic
on
the

wire

•  both
data
and
RPC

•  Configurable
cipher
(eg
RC5,
DES,
3DES)

•  Developed
specifically
based
on
requirements
from

the
IC

•  Reviewed
by
some
experts
here
today
(thanks!)

43
Reserved.

Rolling
Upgrades
and
Wire
CompaGbility

•  RPC
and
Data
Transfer
now
using
Protocol
Buﬀers

•  Easy
for
developers
to
add
new
features
without

breaking
compaGbility

•  Allows
zero-‐downGme
upgrade
between
minor

releases

•  Planning
to
lock
down
client-‐server
compaGbility
even
for

more
major
releases
in
2013

44
Reserved.

What’s
up
next
in
2013?

45

HDFS
Snapshots

•  Full
support
for
eﬃcent
subtree
snapshots

•  Point-‐in-‐Gme
“copy”
of
a
part
of
the
ﬁlesystem

•  Like
a
NetApp
NAS:
simple
administraGve
API

•  Copy-‐on-‐write
(instantaneous
snapshoyng)

•  Can
serve
as
input
for
MR,
distcp,
backups,
etc

•  IniGally
read-‐only,
some
thought
about
read-‐write
in

the
future

•  In
progress
now,
hoping
to
merge
into
trunk
by

summerGme

46
Reserved.

Hierarchical
storage

•  Early
exploraGon
into
SSD/Flash

•  AnGcipaGng
“hybrid”
storage
will
become
common
soon

•  What
performance
improvements
do
we
need
to
take

good
advantage
of
it?

•  Tiered
caching
of
hot
data
onto
ﬂash?

•  Explicit
storage
“pools”
for
apps
to
manage?

•  Big-‐RAM
boxes

•  256GB/box
not
so
expensive
anymore

•  How
can
we
best
make
use
of
all
this
RAM?
Caching!

47
Reserved.

Storage
efficiency

•  Transparent
re-‐compression
of
cold
data?

•  More
efficient
file
formats

•  Columnar
storage
for
Hive,
Impala

•  Faster
to
operate
on
and
more
compact

•  Work
on
“fat
datanodes”

•  36-‐72TB/node
will
require
some
investment
in
DataNode

scaling

•  More
parallelism,
more
efficient
use
of
RAM,
etc.

48
Reserved.

What's New and Upcoming in HDFS - the Hadoop Distributed File System

What's New and Upcoming in HDFS - the Hadoop Distributed File System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to What's New and Upcoming in HDFS - the Hadoop Distributed File System

Similar to What's New and Upcoming in HDFS - the Hadoop Distributed File System (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

What's New and Upcoming in HDFS - the Hadoop Distributed File System