More Related Content
Similar to What's New and Upcoming in HDFS - the Hadoop Distributed File System (20)
More from Cloudera, Inc. (20)
What's New and Upcoming in HDFS - the Hadoop Distributed File System
- 1. What’s
new
and
upcoming
in
HDFS
January
30,
2013
Todd
Lipcon,
SoAware
Engineer
todd@cloudera.com
@tlipcon
1
- 2. IntroducGons
• SoAware
engineer
on
Cloudera’s
Storage
Engineering
team
• CommiIer
and
PMC
Member
for
Apache
Hadoop
and
Apache
HBase
• Projects
in
2012
• Responsible
for
>50%
of
the
code
for
all
phases
of
HA
development
• Also
worked
on
many
performance
and
stability
improvements
• This
presentaGon
is
highly
technical
–
please
feel
free
to
grab/email
me
later
if
you’d
like
to
clarify
anything!
©2013 Cloudera, Inc. All Rights
2
Reserved.
- 3. Outline
• HDFS
2.0
–
what’s
new
in
2012?
• HA
Phase
1
(Q1
2012)
• HA
Phase
2
(Q2-‐Q4
2012)
• Performance
improvements
and
other
new
features
• What’s
coming
in
2013?
• HDFS
Snapshots
• BeIer
storage
density
and
file
formats
• Caching
and
Hierarchical
Storage
Management
©2013 Cloudera, Inc. All Rights
3
Reserved.
- 5. HDFS
HA
Background
• HDFS’s
strength
is
its
simple
and
robust
design
• Single
master
NameNode
maintains
all
metadata
• Scales
to
mul4-‐petabyte
clusters
easily
on
modern
hardware
• TradiGonally,
the
single
master
was
also
a
single
point
of
failure
• Generally
good
availability,
but
not
ops-‐friendly
• No
hot
patch
ability,
no
hot
reconfiguraGon
• No
hot
hardware
replacement
• Hadoop
is
now
mission
cri4cal:
SPOF
not
OK!
©2013 Cloudera, Inc. All Rights
5
Reserved.
- 6. HDFS
HA
Development
Phase
1
• Completed
March
2012
(HDFS-‐1623)
• Introduced
the
StandbyNode,
a
hot
backup
for
the
HDFS
NameNode.
• Relied
on
shared
storage
to
synchronize
namespace
state
• (e.g.
a
NAS
filer
appliance)
• Allowed
operators
to
manually
trigger
failover
to
the
Standby
• Sufficient
for
many
HA
use
cases:
avoided
planned
down4me
for
hardware
and
soAware
upgrades,
planned
machine/OS
maintenance,
configuraGon
changes,
etc.
©2013 Cloudera, Inc. All Rights
6
Reserved.
- 7. HDFS
HA
Architecture
Phase
1
• Parallel
block
reports
sent
to
AcGve
and
Standby
NameNodes
• NameNode
state
shared
by
locaGng
edit
log
on
NAS
over
NFS
• AcGve
NameNode
writes
while
Standby
Node
“tails”
• Client
failover
done
via
client
configuraGon
• Each
client
configured
with
the
address
of
both
NNs:
try
both
to
find
acGve
©2013 Cloudera, Inc. All Rights
7
Reserved.
- 9. Fencing
and
NFS
• Must
avoid
split-‐brain
syndrome
• Both
nodes
think
they
are
acGve
and
try
to
write
to
the
same
edit
log.
Your
metadata
becomes
corrupt
and
requires
manual
intervenGon
to
restart
• Configure
a
fencing
script
• Script
must
ensure
that
prior
acGve
has
stopped
wriGng
• STONITH:
shoot-‐the-‐other-‐node-‐in-‐the-‐head
• Storage
fencing:
e.g
using
NetApp
ONTAP
API
to
restrict
filer
access
• Fencing
script
must
succeed
to
have
a
successful
failover
©2013 Cloudera, Inc. All Rights
9
Reserved.
- 10. Shortcomings
of
Phase
1
• Insufficient
to
protect
against
unplanned
down4me
• Manual
failover
only:
requires
an
operator
to
step
in
quickly
aAer
a
crash
• Various
studies
indicated
this
was
the
minority
of
downGme,
but
sGll
important
to
address
• Requirement
of
a
NAS
device
made
deployment
complex,
expensive,
and
error-‐prone
(we
always
knew
this
was
just
the
first
phase!)
©2013 Cloudera, Inc. All Rights
10
Reserved.
- 11. HDFS
HA
Development
Phase
2
• MulGple
new
features
for
high
availability
• Automa4c
failover,
based
on
Apache
ZooKeeper
• Remove
dependency
on
NAS
(network-‐aIached
storage)
• Address
new
HA
use
cases
• Avoid
unplanned
downGme
due
to
soAware
or
hardware
faults
• Deploy
in
filer-‐less
environments
• Completely
stand-‐alone
HA
with
no
external
hardware
or
soAware
dependencies
• no
Linux-‐HA,
filers,
etc
©2013 Cloudera, Inc. All Rights
11
Reserved.
- 13. AutomaGc
Failover
Goals
• Automa4cally
detect
failure
of
the
AcGve
NameNode
• Hardware,
soAware,
network,
etc.
• Do
not
require
operator
interven4on
to
iniGate
failover
• Once
failure
is
detected,
process
completes
automaGcally
• Support
manually
ini4ated
failover
as
first-‐class
• Operators
can
sGll
trigger
failover
without
having
to
stop
AcGve
• Do
not
introduce
a
new
SPOF
• All
parts
of
auto-‐failover
deployment
must
themselves
be
HA
©2013 Cloudera, Inc. All Rights
13
Reserved.
- 14. AutomaGc
Failover
Architecture
• AutomaGc
failover
requires
ZooKeeper
• Not
required
for
manual
failover
• ZK
makes
it
easy
to:
• Detect
failure
of
AcGve
NameNode
• Determine
which
NameNode
should
become
the
AcGve
NN
©2013 Cloudera, Inc. All Rights
14
Reserved.
- 15. AutomaGc
Failover
Architecture
• New
daemon:
ZooKeeper
Failover
Controller
(ZKFC)
• In
an
auto
failover
deployment,
run
two
ZKFCs
• One
per
NameNode,
on
that
NameNode
machine
• ZKFC
has
three
simple
responsibili4es:
• Monitors
health
of
associated
NameNode
• ParGcipates
in
leader
elec4on
of
NameNodes
• Fences
the
other
NameNode
if
it
wins
elecGon
©2013 Cloudera, Inc. All Rights
15
Reserved.
- 18. Shared
Storage
in
HDFS
HA
• The
Standby
NameNode
synchronizes
the
namespace
by
following
the
AcGve
NameNode’s
transacGon
log
• Each
operaGon
(eg
mkdir(/foo))
is
wriIen
to
the
log
by
the
AcGve
• The
StandbyNode
periodically
reads
all
new
edits
and
applies
them
to
its
own
metadata
structures
• Reliable
shared
storage
is
required
for
correct
opera4on
• In
phase
1,
shared
storage
was
synonymous
with
NFS-‐
mounted
NAS
©2013 Cloudera, Inc. All Rights
18
Reserved.
- 19. Shortcomings
of
NFS-‐based
approach
• Custom
hardware
• Lots
of
our
customers
don’t
have
SAN/NAS
available
in
their
datacenters
• Costs
money,
Gme
and
experGse
• Extra
“stuff”
to
monitor
outside
HDFS
• We
just
moved
the
SPOF,
didn’t
eliminate
it!
• Complicated
• Storage
fencing,
NFS
mount
opGons,
mulGpath
networking,
etc
• OrganizaGonally
complicated:
dependencies
on
storage
ops
team
• NFS
issues
• Buggy
client
implementaGons,
liIle
control
over
Gmeout
behavior,
etc
©2013 Cloudera, Inc. All Rights
19
Reserved.
- 20. Primary
Requirements
for
Improved
Storage
• No
special
hardware
(PDUs,
NAS)
• No
custom
fencing
configuraGon
• Too
complicated
==
too
easy
to
misconfigure
• No
SPOFs
• punGng
to
filers
isn’t
a
good
opGon
• need
something
inherently
distributed
©2013 Cloudera, Inc. All Rights
20
Reserved.
- 21. Secondary
Requirements
• Configurable
degree
of
fault
tolerance
• Configure
N
nodes
to
tolerate
(N-‐1)/2
• Making
N
bigger
(within
reasonable
bounds)
shouldn’t
hurt
performance.
Implies:
• Writes
done
in
parallel,
not
pipelined
• Writes
should
not
wait
on
slowest
replica
• Locate
replicas
on
exisGng
hardware
investment
(eg
share
with
JobTracker,
NN,
SBN)
©2013 Cloudera, Inc. All Rights
21
Reserved.
- 22. OperaGonal
Requirements
• Should
be
operable
by
exisGng
Hadoop
admins.
Implies:
• Same
metrics
system
(“hadoop
metrics”)
• Same
configuraGon
system
(xml)
• Same
logging
infrastructure
(log4j)
• Same
security
system
(Kerberos-‐based)
• Allow
exisGng
ops
to
easily
deploy
and
manage
the
new
feature
• Allow
exisGng
Hadoop
tools
to
monitor
the
feature
• (eg
Cloudera
Manager,
Ganglia,
etc)
©2013 Cloudera, Inc. All Rights
22
Reserved.
- 23. Our
soluGon:
QuorumJournalManager
• QuorumJournalManager
(client)
• Plugs
into
JournalManager
abstracGon
in
NN
(instead
of
exisGng
FileJournalManager)
• Provides
edit
log
storage
abstracGon
• JournalNode
(server)
• Standalone
daemon
running
on
an
odd
number
of
nodes
• Provides
actual
storage
of
edit
logs
on
local
disks
• Could
run
inside
other
daemons
in
the
future
©2013 Cloudera, Inc. All Rights
23
Reserved.
- 24. Architecture
©2013 Cloudera, Inc. All Rights
24
Reserved.
- 25. Commit
protocol
• NameNode
accumulates
edits
locally
as
they
are
logged
• On
logSync(),
sends
accumulated
batch
to
all
JNs
via
Hadoop
RPC
• Waits
for
success
ACK
from
a
majority
of
nodes
• Majority
commit
means
that
a
single
lagging
or
crashed
replica
does
not
impact
NN
latency
• Latency
@
NN
=
median(Latency
@
JNs)
• Uses
the
well-‐known
Paxos
algorithm
to
perform
recovery
of
any
in-‐flight
edits
on
leader
switchover
©2013 Cloudera, Inc. All Rights
25
Reserved.
- 26. JN
Fencing
• How
do
we
prevent
split-‐brain?
• Each
instance
of
QJM
is
assigned
a
unique
epoch
number
• provides
a
strong
ordering
between
client
NNs
• Each
IPC
contains
the
client’s
epoch
• JN
remembers
on
disk
the
highest
epoch
it
has
seen
• Any
request
from
an
earlier
epoch
is
rejected.
Any
from
a
newer
one
is
recorded
on
disk
• Distributed
Systems
folks
may
recognize
this
technique
from
Paxos
and
other
literature
©2013 Cloudera, Inc. All Rights
26
Reserved.
- 27. Fencing
with
epochs
• Fencing
is
now
implicit
• The
act
of
becoming
acGve
causes
any
earlier
acGve
NN
to
be
fenced
out
• Since
a
quorum
of
nodes
has
accepted
the
new
acGve,
any
other
IPC
by
an
earlier
epoch
number
can’t
get
quorum
• Eliminates
confusing
and
error-‐prone
custom
fencing
configura4on
©2013 Cloudera, Inc. All Rights
27
Reserved.
- 28. Other
implementaGon
features
• Hadoop
Metrics
• lag,
percenGle
latencies,
etc
from
perspecGve
of
JN,
NN
• metrics
for
queued
txns,
%
of
Gme
each
JN
fell
behind,
etc,
to
help
suss
out
a
slow
JN
before
it
causes
problems
• Security
• full
Kerberos
and
SSL
support:
edits
can
be
opGonally
encrypted
in-‐flight,
and
all
access
is
mutually
authenGcated
©2013 Cloudera, Inc. All Rights
28
Reserved.
- 30. TesGng
• Randomized
fault
test
• Runs
all
communicaGons
in
a
single
thread
with
determinisGc
order
and
fault
injecGons
based
on
a
seed
• Caught
a
number
of
really
subtle
bugs
along
the
way
• Run
as
an
MR
job:
5000
fault
tests
in
parallel
• MulGple
CPU-‐years
of
stress
tesGng:
found
2
bugs
in
JeIy!
• Cluster
tesGng:
100-‐node,
MR,
HBase,
Hive,
etc
• Commit
latency
in
pracGce:
within
same
range
as
local
disks
(beIer
than
one
of
two
local
disks,
worse
than
the
other
one)
©2013 Cloudera, Inc. All Rights
30
Reserved.
- 31. Deployment
• Most
customers
running
3
JNs
(tolerate
1
failure)
• 1
on
NN,
1
on
SBN,
1
on
JobTracker/ResourceManager
• OpGonally
run
2
more
(eg
on
basGon/gateway
nodes)
to
tolerate
2
failures
• No
new
hardware
investment
• Refer
to
docs
for
detailed
configuraGon
info
©2013 Cloudera, Inc. All Rights
31
Reserved.
- 32. Status
• Merged
into
Hadoop
development
trunk
in
early
October
• Available
in
CDH4.1,
will
be
in
upcoming
Hadoop
2.1
• Deployed
at
several
customer/community
sites
with
good
success
so
far
(no
lost
data)
• In
contrast,
we’ve
had
several
issues
with
misconfigured
NFS
filers
causing
downGme
• Highly
recommend
you
use
Quorum
Journaling
instead
of
NFS!
©2013 Cloudera, Inc. All Rights
32
Reserved.
- 33. Summary
of
HA
Improvements
• Run
an
acGve
NameNode
and
a
hot
Standby
NameNode
• AutomaGcally
triggers
seamless
failover
using
Apache
ZooKeeper
• Stores
shared
metadata
on
QuorumJournalManager:
a
fully
distributed,
redundant,
low
latency
journaling
system.
• All
improvements
available
now
in
HDFS
branch-‐2
and
CDH4.1
©2013 Cloudera, Inc. All Rights
33
Reserved.
- 35. Performance
Improvements
(overview)
• Several
improvements
made
for
Impala
• Much
faster
libhdfs
• APIs
for
spindle-‐based
scheduling
• Other
more
general
improvements
(especially
for
HBase
and
Accumulo)
• Ability
to
read
directly
from
block
files
in
secure
environments
• Ability
for
applicaGons
to
perform
their
own
checksums
and
eliminate
IOPS
©2013 Cloudera, Inc. All Rights
35
Reserved.
- 36. libhdfs
“direct
read”
support
(HDFS-‐2834)
• This
can
also
benefit
apps
like
HBase,
Accumulo,
and
MR
with
a
bit
more
work
(TBD
in
2013)
36
- 37. Disk
locaGons
API
(HDFS-‐3672)
• HDFS
has
always
exposed
node
locality
informaGon
• Map<Block,
List<Datanode
Addresess>>
• Now
also
can
expose
disk
locality
informaGon
• Map<Replica,
List<Spindle
IdenGfiers>>
• Impala
uses
this
API
to
keep
all
disks
spinning
at
full
throughput
• ~2x
improvement
on
IO-‐bound
workloads
on
12-‐spindle
machines
©2013 Cloudera, Inc. All Rights
37
Reserved.
- 38. Short-‐circuit
reads
• “Short
circuit”
allows
HDFS
clients
to
open
HDFS
block
files
directly
from
the
local
filesystem
• Avoids
context
switches
and
trips
back
and
forth
from
user
space
to
kernel
space
memory,
TCP
stack,
etc
• Uses
50%
less
CPU,
avoids
significant
latency
when
reading
data
from
Linux
buffer
cache
• SequenGal
IO
performance:
2x
improvement
• Random
IO
performance:
3.5x
improvement
• This
has
existed
for
a
while
in
insecure
setups
only!
• Clients
need
read
access
to
all
block
files
L
©2013 Cloudera, Inc. All Rights
38
Reserved.
- 39. Secure
short-‐circuit
reads
(HDFS-‐347)
• DataNode
conGnues
to
arbitrate
access
to
block
files
• Opens
input
streams
and
passes
them
to
the
DFS
client
aAer
authenGcaGon
and
authorizaGon
checks
• Uses
a
trick
involving
Unix
Domain
Sockets
(sendmsg
with
SCM_RIGHTS)
• Now
perf-‐sensiGve
apps
like
HBase,
Accumulo,
and
Impala
can
safely
configure
this
feature
in
all
environments
©2013 Cloudera, Inc. All Rights
39
Reserved.
- 40. Checksum
skipping
(HDFS-‐3429)
• Problem:
HDFS
stores
block
data
and
block
checksums
in
separate
files
• A
truly
random
read
incurs
two
seeks
instead
of
one!
• Solu4on:
HBase
now
stores
its
own
checksums
on
its
own
internal
64KB
blocks
• But
it
turns
out
that
prior
versions
of
HDFS
sGll
read
the
checksum,
even
if
the
client
flipped
verificaGon
off
• Fixing
this
yielded
a
40%
reduc4on
in
IOPS
and
latency
for
a
mulG-‐TB
uniform
random-‐read
workload!
©2013 Cloudera, Inc. All Rights
40
Reserved.
- 41. SGll
more
to
come?
• Not
a
ton
leA
on
the
read
path
• Write
path
sGll
has
some
low
hanging
fruit
–
hang
Gght
for
next
year
• Reality
check
(mulG-‐threaded
random-‐read)
• Hadoop
1.0:
264MB/sec
• Hadoop
2.x:
1393MB/sec
• We’ve
come
a
long
way
(5x)
in
a
few
years!
©2013 Cloudera, Inc. All Rights
41
Reserved.
- 43. On-‐the-‐wire
EncrypGon
• Strong
encrypGon
now
supported
for
all
traffic
on
the
wire
• both
data
and
RPC
• Configurable
cipher
(eg
RC5,
DES,
3DES)
• Developed
specifically
based
on
requirements
from
the
IC
• Reviewed
by
some
experts
here
today
(thanks!)
©2013 Cloudera, Inc. All Rights
43
Reserved.
- 44. Rolling
Upgrades
and
Wire
CompaGbility
• RPC
and
Data
Transfer
now
using
Protocol
Buffers
• Easy
for
developers
to
add
new
features
without
breaking
compaGbility
• Allows
zero-‐downGme
upgrade
between
minor
releases
• Planning
to
lock
down
client-‐server
compaGbility
even
for
more
major
releases
in
2013
©2013 Cloudera, Inc. All Rights
44
Reserved.
- 46. HDFS
Snapshots
• Full
support
for
efficent
subtree
snapshots
• Point-‐in-‐Gme
“copy”
of
a
part
of
the
filesystem
• Like
a
NetApp
NAS:
simple
administraGve
API
• Copy-‐on-‐write
(instantaneous
snapshoyng)
• Can
serve
as
input
for
MR,
distcp,
backups,
etc
• IniGally
read-‐only,
some
thought
about
read-‐write
in
the
future
• In
progress
now,
hoping
to
merge
into
trunk
by
summerGme
©2013 Cloudera, Inc. All Rights
46
Reserved.
- 47. Hierarchical
storage
• Early
exploraGon
into
SSD/Flash
• AnGcipaGng
“hybrid”
storage
will
become
common
soon
• What
performance
improvements
do
we
need
to
take
good
advantage
of
it?
• Tiered
caching
of
hot
data
onto
flash?
• Explicit
storage
“pools”
for
apps
to
manage?
• Big-‐RAM
boxes
• 256GB/box
not
so
expensive
anymore
• How
can
we
best
make
use
of
all
this
RAM?
Caching!
©2013 Cloudera, Inc. All Rights
47
Reserved.
- 48. Storage
efficiency
• Transparent
re-‐compression
of
cold
data?
• More
efficient
file
formats
• Columnar
storage
for
Hive,
Impala
• Faster
to
operate
on
and
more
compact
• Work
on
“fat
datanodes”
• 36-‐72TB/node
will
require
some
investment
in
DataNode
scaling
• More
parallelism,
more
efficient
use
of
RAM,
etc.
©2013 Cloudera, Inc. All Rights
48
Reserved.