Improving h base availability and repair

Improving
HBase
Availability
and
Repair

Improving
HBase
Availability
and
Repair

Jeﬀ
Bean,
Jonathan
Hsieh
{jw2ean,jon}
@cloudera.com

6/13/12

Who
Are
We?

•  Jeﬀ
Bean

•  Designated
Support
Engineer,
Cloudera

•  EducaGon
Program
Lead,
Cloudera

•  Jonathan
Hsieh

•  SoJware
Engineer,
Cloudera

•  Apache
HBase
CommiLer
and
PMC
member

Hadoop
Summit
2012.
6/13/12

Copyright
2012
2

Cloudera
Inc,
All
Rights
Reserved

What
is
Apache
HBase?

Apache
HBase
is
an

reliable,
column-‐
oriented
data
store

that
provides

consistent,
low-‐
latency,
random

read/write
access.

Hadoop
Summit
2012.
6/13/12

Copyright
2012
3

Cloudera
Inc,
All
Rights
Reserved

Fault
Tolerance
vs
Highly
Available

•  Fault
tolerant:

•  Ability
to
recover
service
if
a

component
fails,
without
losing

data.
Fault
Tolerant

•  Highly
Available:

•  Ability
to
quickly
recover
service
if
Highly

a
component
fails,
without
losing
Available

data.

•  Goal:
Minimize
downGme!

Hadoop
Summit
2012.
6/13/12

Copyright
2012
4

Cloudera
Inc,
All
Rights
Reserved

HBase
Architecture

•  HBase
is
designed
to
be
fault
tolerant

and
highly
available

•  It
depends
on
other
systems
to
be
as
well.

App
MR

•  ReplicaDon
for
fault
tolerance

•  Serve
regions
from
any
Region
server

•  Failover
HMasters

•  ZK
Quorums

•  HDFS
Block
replicaGon
on
Data
Nodes

ZK
HDFS

•  But
replicaGon
doesn’t
guarantee
high

availability

•  There
can
sGll
be
soJware
or
human
faults

Hadoop
Summit
2012.
6/13/12

Copyright
2012
5

Cloudera
Inc,
All
Rights
Reserved

Causes
of
HBase
DownDme

HBase
DownDme

DistribuDon

•  Unplanned
Maintenance

•  Hardware
failures

•  SoJware
errors

Planned

•  Human
error

•  Planned
Maintenance

•  Upgrades
Unplanned

•  MigraGons

Hadoop
Summit
2012.
6/13/12

Copyright
2012
6

Cloudera
Inc,
All
Rights
Reserved

Causes
of
Unexpected
Maintenance
Incidents

Unplanned
Maintenance:
Root

Cause
from
Cloudera
Support

•  MisconﬁguraGon

•  Metadata
CorrupGons

Repair

•  Network
/
HW
problems
Needed

HBase,
ZK,

28%

•  SW
problems
MR,
HDFS

Misconﬁg

44%

Fix
HW/
•  Long
recovery
Gme
NW

16%
Patch

•  Automated
and
manual
Required

12%

Source:
Cloudera’s
producGon
HBase
Support
Tickets

CDH3’s
HBase
0.90.x,
Hadoop
0.20.x/1.0.x

Hadoop
Summit
2012.
6/13/12

Copyright
2012
7

Cloudera
Inc,
All
Rights
Reserved

Outline

•  Where
we
were

•  HBase
0.90.x
+
Hadoop
0.20.x/1.0.x

•  Case
Studies

•  Where
we
are
today

•  HBase
0.92.x/0.94.x
+
Hadoop
2.0.x

•  Feature
Summary

•  Where
we
are
going

•  HBase
0.96.x
+
Hadoop
2.x

•  Feature
Preview

Hadoop
Summit
2012.
6/13/12

Copyright
2012
8

Cloudera
Inc,
All
Rights
Reserved

[T]here
are
known
knowns;
there
are
things
we
know
we
know.

We
also
know
there
are
known
unknowns;
that
is
to
say
we
know

there
are
some
things
we
do
not
know.

But
there
are
also
unknown
unknowns
–
there
are
things
we
do
not

know
we
don't
know.

—United
States
Secretary
of
Defense
Donald
Rumsfeld

WHERE
WE
WERE:

CASE
STUDIES

Hadoop
Summit
2012.
6/13/12

Copyright
2012
9

Cloudera
Inc,
All
Rights
Reserved

Best
PracDces
to
avoid
hazards

Unplanned
Maintenance:
Root

Cause
from
Cloudera
Support

Repair

Needed

HBase,
ZK,

28%

MR,
HDFS

Misconﬁg

44%

Fix
HW/
NW

16%
Patch

Required

12%

CAN PREVENT HBASE Source:
Cloudera’s
producGon
HBase
Support
Tickets

MISCONFIGURATIONS CDH3’s
HBase
0.90.x,
Hadoop
0.20.x/1.0.x

Hadoop
Summit
2012.
6/13/12

Copyright
2012
10

Cloudera
Inc,
All
Rights
Reserved

Case
#1:
Memory
Over-‐subscripDon
Hazard

Misconﬁg
Bad
Outcome

Masters
Take

Node
A
swaps

•  Too
many
MR
Slots
•  MapReduce
tasks
fail
AcGon

•  MR
Slots
too
large
•  HDFS
datanode

•  “Arbitrary”
processes
operaGons
Gme
out
•  JobTracker
blacklists
TT

pause
or
unresponsive
on
node
B

•  HBase
client
operaGons

fail
•  Jobs
fail
or
run
slow

•  NameNode
re-‐replicates

blocks
from
node
A

Node

A
Under
Node
B
can’t

Load
connect
to
node
A

Hadoop
Summit
2012.
6/13/12

Copyright
2012
11

Cloudera
Inc,
All
Rights
Reserved

Case
#2,
#3:
Hazards
of
Abusing
HDFS
and
ZK

Millions
of
HDFS
ﬁles
Millions
of
ZK
nodes

Bad
PracGce
MisconﬁguraGon

500,000
blocks
per
Millions
of
ZK
znodes

datanode
400MB
snapshot

Heartbeat
thread
SW
Bug
ZK
fails
to
create
new

blocks
IO
snapshots,
fails

RS
cannot
access
Bad
outcome

HBase
goes
down

HDFS

HBase
goes
down
Bad
outcome
HBase
fails
to
restart

SW
Bug,
Worse

Hadoop
Summit
2012.
6/13/12

Copyright
2012

outcome
12

Cloudera
Inc,
All
Rights
Reserved

Case
#4:
SpliYng
CorrupDon
from
HW
failure

Manual,
Slow,
and

HW
Failure
requires
expert

HBase
has

Region
regions
MulGple
6
hour

Network
failure
Split
Recovery
inconsistencies

aLempts
to
manual
repair

(takes
out
NN)
incomplete

split
(overlaps
/
sessions.

holes)

SW
Bug

Hadoop
Summit
2012.
6/13/12

Copyright
2012
13

Cloudera
Inc,
All
Rights
Reserved

Case
#5:
Slow
recovery
from
HW
failure

Correct
but
slow!

Human
error

On
restart,

RS
loses
9
hour
hlog

Network
Root
Manual

HDFS,
spliung

HW
failure
and
.META.
Repairs

WALs
recovery

assign
fails

SW
error

Hadoop
Summit
2012.
6/13/12

Copyright
2012
14

Cloudera
Inc,
All
Rights
Reserved

IniDal
Lessons

•  Use
Best
pracGces
to
avoid
problems

•  ConservaGve
first

•  Avoid
unstable
features

•  What
can
we
do?

•  Fix
the
bugs

•  Recover
from
problems
faster

•  Make
people
smarter
to
avoid
hazards
and
misconfiguraGons

•  Make
soJware
smarter
to
prevent
hazards
and

misconfiguraGons

Hadoop
Summit
2012.
6/13/12

Copyright
2012
15

Cloudera
Inc,
All
Rights
Reserved

In
war,
then,
let
your
great
object
be
victory,

not
lengthy
campaigns.

-‐-‐
Sun
Tzu

WHERE
WE
ARE
TODAY

HBASE
0.92.X
+
HADOOP
2.0.X

Hadoop
Summit
2012.
6/13/12

Copyright
2012
16

Cloudera
Inc,
All
Rights
Reserved

Goal:
Reduce
unexpected
downDme
by

recovering
faster

•  Removing
the
SPOFs

•  HA
HDFS

•  Faster
Recovery

•  Improved
hbck

•  Distributed
Log
spliung

Hadoop
Summit
2012.
6/13/12

Copyright
2012
17

Cloudera
Inc,
All
Rights
Reserved

Problem:
HDFS
NN
goes
down
under
HBase

•  HBase
depends
on
HDFS.
App
MR

•  If
HDFS
is
down,
HBase
goes
down.

•  RamiﬁcaGons.

•  Forces
Recovery
mechanism

•  Caused
some
data
corrupGons

ZK
HDFS

•  Ideally
we
avoid
having
to
do
recovery
at
all.

Hadoop
Summit
2012.
6/13/12

Copyright
2012
18

Cloudera
Inc,
All
Rights
Reserved

HBase-‐HDFS
HA
Nodes

NameNode

(acGve)
HMaster

(metadata
server)
(region
metadata)

NameNode

(standby)
HMaster

(acGve-‐standby
(hot
standby)

hot
failover)

ZooKeeper

Quorum

HDFS
DataNodes
HBase
RegionServers

Hadoop
Summit
2012.
6/13/12

Copyright
2012
19

Cloudera
Inc,
All
Rights
Reserved

HBase-‐HDFS
HA
Nodes:
Transparent
to
HBase

HMaster

(region
metadata)

HMaster

NameNode

(acGve)
(hot
standby)

ZooKeeper

Quorum

HDFS
DataNodes
HBase
RegionServers

Hadoop
Summit
2012.
6/13/12

Copyright
2012
20

Cloudera
Inc,
All
Rights
Reserved

HBase-‐HDFS
HA
Nodes:
No
more
SPOF

HMaster

NameNode

(acGve)
(acGve)

ZooKeeper

Quorum

HDFS
DataNodes
HBase
RegionServers

Hadoop
Summit
2012.
6/13/12

Copyright
2012
21

Cloudera
Inc,
All
Rights
Reserved

Recovery
operaDons

•  If
a
network
switch
fails
or
if
there
is
a
power
outage,

•  HBase,
ZK,
and
HA
HDFS
will
fail

•  Will
always
sGll
rely
on
recovery
mechanisms.

•  Need
to
be
able
to
quickly
recover

•  Metadata
Invariants
to
ﬁx
metadata
corrupGons

•  Data
Consistency
to
restore
ACID
guarantees

Hadoop
Summit
2012.
6/13/12

Copyright
2012
22

Cloudera
Inc,
All
Rights
Reserved

HBase
Metadata
CorrupDons

•  Internal
HBase
metadata

Unplanned
Maintenance:
Root
Cause

corrupGons
from
Cloudera
Support

•  Prevent
HBase
from
starGng

•  Cause
some
regions
to
be

Repair

unavailable.
Needed

28%
HBase,
ZK,

MR,
HDFS

Misconﬁg

•  Repairs
are
intricate
and
44%

Fix
HW/
can
cause
extended
periods
NW

of
downGme.
16%
Patch

Required

12%

Hadoop
Summit
2012.
6/13/12

Copyright
2012
23

Cloudera
Inc,
All
Rights
Reserved

HBase
Metadata
Invariants

Table
Integrity
Region
Consistency

•  Every
key
shall
get
assigned
•  Metadata
about
regions
should

to
a
single
region.
agree
in
hdfs,
meta
and
region

server
assignment.

[‘
‘,A)

[A,B)
regioninfo

in
META

[B,
C)

[C,
D)

[D,
E)
Good

[E,
F)
region

assigned

.regioninfo

[F,
G)
to

RS
in
HDFS

[G,
‘
‘)

Hadoop
Summit
2012.
6/13/12

Copyright
2012
24

Cloudera
Inc,
All
Rights
Reserved

DetecDng
and
Repairing
corrupDon
with
hbck

•  HBase
0.90
hbck

•  Checks
an
HBase

instance’s
internals

invariants.

•  HBase
hbck
today

•  Checks
and
can
ﬁx

problem
in
an
HBase

instance’s
internal

invariants

•  0.90.7,
0.92.2,

0.94.0

•  CDH3u4,
CDH4

Hadoop
Summit
2012.
6/13/12

Copyright
2012
25

Cloudera
Inc,
All
Rights
Reserved

Case
#4
redux:
SpliYng
CorrupDon

Manual,
Slow,
and

HW
Failure
requires
expert

HBase
has

Region
Network
failure
regions
MulGple
6
hour

Split
Recovery
inconsistencies

aLempts
to
manual
repair

(takes
out
NN)
incomplete

split
(overlaps
/
sessions.

holes)

SW
Bug

Hadoop
Summit
2012.
6/13/12

Copyright
2012
26

Cloudera
Inc,
All
Rights
Reserved

Case
#4
redux:
SpliYng
CorrupDon

HW
Failure

HBase
has

Region
Network
failure
regions
Automated

Split
Recovery
inconsistencies

aLempts
to
repair
tool

(takes
out
NN)
incomplete

split
(overlaps
/
(Minutes)

holes)

SW
Bug
Fixes
are
quicker,

operator
can
use

Hadoop
Summit
2012.
6/13/12

Copyright
2012
27

Cloudera
Inc,
All
Rights
Reserved

Case
#4
redux:
SpliYng
CorrupDon

HW
Failure

Minor

HBase

Region
Network
failure
inconsistencies
Automated

Split
Recovery

aLempts
to
repair
tool

(takes
out
NN)
incomplete
(bad

split
(seconds)

assignments)

Fixed
SW
Bug

Hadoop
Summit
2012.
6/13/12

Copyright
2012
28

Cloudera
Inc,
All
Rights
Reserved

Data
Consistency

•  When
a
region
server
goes
down,
it
tries
to
ﬂush
data
in

memory
to
HDFS.

•  If
it
cannot
write
to
HDFS,
it
relies
on
the
WAL/HLog.

•  Recovery
via
the
HLog
is
vital
to
prevent
data
loss

•  Understand
the
write
path.

•  Recovery:

HLog
spliung.

•  Faster
Recovery:
Distributed
HLog
spliung.

Hadoop
Summit
2012.
6/13/12

Copyright
2012
29

Cloudera
Inc,
All
Rights
Reserved

Write
Path
(Put
/
Delete
/
Increment)

HBase

client
Region
Server

HLog
Put

Server

HRegion
HRegion

MemStore
MemStore

Put

HStore

HStore

HStore

HStore

Hadoop
Summit
2012.
6/13/12

Copyright
2012
30

Cloudera
Inc,
All
Rights
Reserved

Write
Path
(Put
/
Delete
/
Increment)

Note,
both
regions

write
to
the
same

HBase
HLog

client
Region
Server

Put

HLog
Put
Put

Server

HRegion
HRegion

MemStore
MemStore

Put
Put

HStore

HStore

HStore

HStore

Hadoop
Summit
2012.
6/13/12

Copyright
2012
31

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng

HMaster

RegionServer
RegionServer
RegionServer

HLog1
HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

mem
mem
mem
mem
mem
mem

Hadoop
Summit
2012.
6/13/12

Copyright
2012
32

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng

HMaster

RegionServer
RegionServer
RegionServer

HLog1
HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

mem
mem
mem
mem
mem
mem

Hadoop
Summit
2012.
6/13/12

Copyright
2012
33

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng

HMaster

HLog1
HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
34

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng
Spliung
log
1

HMaster

HLog1
HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
35

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng
Spliung
log
2

HMaster

HLog

HLog1
HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
36

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng
Spliung
log
3

HMaster

HLog

HLog1
HLog

HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
37

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng
Spliung
log
100

HMaster

HLog
HLog
HLog

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
38

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng
Whew.

I
did
a
lot
of

spliung
work.

That

took
9
hours!

HMaster

HLog
HLog
HLog

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
39

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng
RegionServers,
here

are
your
region

assignments.

HMaster

RegionServer4
RegionServer5
RegionServer6

…

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
40

Cloudera
Inc,
All
Rights
Reserved

Log
SpliYng
Victory!

HMaster

RegionServer4
RegionServer5
RegionServer6

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

mem
mem
mem
mem
mem
mem

Hadoop
Summit
2012.
6/13/12

Copyright
2012
41

Cloudera
Inc,
All
Rights
Reserved

Can
we
recover
more
quickly?

•  In
the
case
study,
this
is
all
done
serially
by
the
master

•  The
master
took
9
hours
to
recovery.

•  The
100
region
server
nodes
were
idle.

•  Let’s
use
the
idle
machines
to
do
spliung
in
parallel!

•  Distributed
log
spliYng
(HBASE-‐1364)

•  Introduced
in
0.92.0
by
Prakash
Khemani
(Facebook)

•  Included
in
CDH4
(0.92.1)

•  Backported
to
CDH3u3
(oﬀ
by
default)

Hadoop
Summit
2012.
6/13/12

Copyright
2012
42

Cloudera
Inc,
All
Rights
Reserved

Distributed
Log
SpliYng
I’m
the
boss.

HMaster

RegionServer
RegionServer
RegionServer

HLog1
HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

mem
mem
mem
mem
mem
mem

Hadoop
Summit
2012.
6/13/12

Copyright
2012
43

Cloudera
Inc,
All
Rights
Reserved

Distributed
Log
SpliYng
There
is
a
lot
of

spliung
work
here,

HMaster
let’s
split
it
up.

HLog1
HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
44

Cloudera
Inc,
All
Rights
Reserved

Distributed
Log
SpliYng
You
guys
do
the
work

for
me.

HMaster

RegionServer4
RegionServer5
RegionServer6

HLog1
HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
45

Cloudera
Inc,
All
Rights
Reserved

Distributed
Log
SpliYng
You
guys
do
the
work

for
me.

HMaster

RegionServer4
RegionServer5
RegionServer6

HLog1
HLog2
HLog3

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
46

Cloudera
Inc,
All
Rights
Reserved

Distributed
Log
SpliYng
Great,
that
took
5.4

minutes.

HMaster

RegionServer4
RegionServer5
RegionServer6

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
47

Cloudera
Inc,
All
Rights
Reserved

Distributed
Log
SpliYng
Good
Job,
here
are

your
region

assignments.

HMaster

RegionServer4
RegionServer5
RegionServer6

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Hadoop
Summit
2012.
6/13/12

Copyright
2012
48

Cloudera
Inc,
All
Rights
Reserved

Distributed
Log
SpliYng
Like
a
Boss.

HMaster

RegionServer4
RegionServer5
RegionServer6

…

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

mem
mem
mem
mem
mem
mem

Hadoop
Summit
2012.
6/13/12

Copyright
2012
49

Cloudera
Inc,
All
Rights
Reserved

Case
#5
redux:
Network
failure
and
slow
recovery

Correct
but
slow!

Human
error

On
restart,

RS
loses
9
hour
hlog

Network
Root
Manual

HDFS,
spliung

HW
failure
and
.META.
Repair

WALs
recovery

assign
fails

Hadoop
Summit
2012.
6/13/12

Copyright
2012
50

Cloudera
Inc,
All
Rights
Reserved

Case
#5
redux:
Network
failure
and
slow
recovery

Correct
and
Faster!

Human
error

On
restart,
5.4
Minute

RS
loses

Network
Root
AutomaGc
hlog

HDFS,

HW
failure
and
.META.
repairs
spliung

WALs

assign
fails
recovery

Fixed!

Hadoop
Summit
2012.
6/13/12

Copyright
2012
51

Cloudera
Inc,
All
Rights
Reserved

WHERE
WE
ARE
GOING

HBASE
0.96
+
HADOOP
2.X

Hadoop
Summit
2012.
6/13/12

Copyright
2012
52

Cloudera
Inc,
All
Rights
Reserved

Themes

•  Minimizing
Planned
downGme
HBase
DownDme

•  Changing
conﬁguraGons
DistribuDon

•  Online
Schema
Change

(experimental
in
0.92,
0.94)

•  Rolling
Restarts
Planned

•  Wire
compaGbility

Unplanned

Hadoop
Summit
2012.
6/13/12

Copyright
2012
53

Cloudera
Inc,
All
Rights
Reserved

Table
unavailable
when
changing
schema

•  Changing
table
schema
requires
disabling
table

•  disable
table,
alter
table
schema,
enable
table

•  Schema
includes
compression,
cf’s,
caching,
Ll,
versions.

•  Goal:
Quickly
change
table
and
column
conﬁguraGon

seungs
without
having
to
disable
Hbase
tables.

•  Feature
Online
Schema
Change
(HBASE-‐1730)

•  Included
in
but
considered
experimental
in
HBase
0.92/0.94.

•  Contributed
by
Facebook

Hadoop
Summit
2012.
6/13/12

Copyright
2012
54

Cloudera
Inc,
All
Rights
Reserved

Changing
Server
Conﬁgs
and
Sogware
updates

•  Rolling
restart
is
an
operaGon
for
upgrading
an
HBase

cluster
to
a
compaGble
version
while
keeping
HBase

available
and
serving
data.

•  Handle
server
conﬁg
changes.

•  Handle
code
changes
like
ho}ixes
or
compaGble
upgrades

Hadoop
Summit
2012.
6/13/12

Copyright
2012
55

Cloudera
Inc,
All
Rights
Reserved

Rolling
restart
limitaDons

•  There
are
limitaGons
on
Unplanned
Maintenance:
Root

rolling
restarts

Cause
from
Cloudera
Support

•  All
Servers
and
clients
must
be

wire
compaGble

•  All
must
be
able
to
read
old

data
in
FS
and
ZK.
Repair

Needed

HBase,
ZK,

28%

•  RamiﬁcaGons:

MR,
HDFS

Misconﬁg

•  Only
minor
version
upgrades
44%

possible
Fix
HW/
•  New
features
that
change
RPCs
NW

require
custom
compaGbility
16%
Patch

shims.
Required

•  Data
format
changes
not
12%

possible
across
minor
versions.

Source:
Cloudera’s
producGon
HBase
Support
Tickets

CDH3’s
HBase
0.90.x,
Hadoop
0.20.x/1.0.x

Hadoop
Summit
2012.
6/13/12

Copyright
2012
70

Cloudera
Inc,
All
Rights
Reserved

HBase
CompaDbility
and
Extensibility

•  Coming
in
HBase
0.96

•  HBASE-‐5305
and
friends

•  Goals:

•  Allow
API
and
changes
and
persistent
data
structure
changes

while
guarantees
compaGbility
between
diﬀerent
minor

versions
(0.96.0
-‐>
0.96.1)

•  HBase
client
server
compaGbility
between
Major
Versions.

(0.96.x
-‐>
0.98.x)

Hadoop
Summit
2012.
6/13/12

Copyright
2012
71

Cloudera
Inc,
All
Rights
Reserved

HDFS
Wire
CompaDbility

•  Here
in
HDFS
2.0.x

•  HADOOP-‐7347
and
friends

App
MR

•  Goals:

•  Allow
API
and
changes
while

guaranteeing
wire
compaGbility

between
diﬀerent
minor
versions

•  HDFS
client
server
compaGbility
ZK
HDFS

between
Major
Versions.

Hadoop
Summit
2012.
6/13/12

Copyright
2012
72

Cloudera
Inc,
All
Rights
Reserved

HDFS
Wire
CompaDbility

•  Here
in
HDFS
2.0.x

•  HADOOP-‐7347
and
friends

App
MR

•  Goals:

•  Allow
API
and
changes
while

guaranteeing
wire
compaGbility

between
diﬀerent
minor
versions

•  HDFS
client
server
compaGbility
ZK
HDFS

between
Major
Versions.

Hadoop
Summit
2012.
6/13/12

Copyright
2012
73

Cloudera
Inc,
All
Rights
Reserved

Improving
how
we
handling
causes
of
downDme

HBase
DownDme
DistribuDon

Unplanned
Maintenance:
Root

Cause
from
Cloudera
Support

Wire

compat
Best

hbck
pracGces

Repair

Planned

Needed

HBase,
ZK,

28%

MR,
HDFS

Misconﬁg

44%

Unplanned

Fix
HW/
NW

16%
Patch

Required

hbck
and
12%

distributed
log
Wire

spliung
compat

Hadoop
Summit
2012.
6/13/12

Copyright
2012
75

Cloudera
Inc,
All
Rights
Reserved

Improving h base availability and repair

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Improving h base availability and repair

Similar to Improving h base availability and repair (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Improving h base availability and repair