SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
1.
HBase Low Latency
Nick
Dimiduk,
Hortonworks
(@xefyr)
Nicolas
Liochon,
Scaled
Risk
(@nkeywal)
HBaseCon
May
5,
2014
2.
Agenda
• Latency,
what
is
it,
how
to
measure
it
• Write
path
• Read
path
• Next
steps
3.
What’s low latency
Latency
is
about
percenJles
• Average
!=
50%
percenJle
• There
are
oRen
order
of
magnitudes
between
«
average
»
and
«
95
percenJle
»
• Post
99%
=
«
magical
1%
».
Work
in
progress
here.
• Meaning
from
micro
seconds
(High
Frequency
Trading)
to
seconds
(interacJve
queries)
• In
this
talk
milliseconds
4.
Measure latency
bin/hbase
org.apache.hadoop.hbase.PerformanceEvaluaJon
• More
opJons
related
to
HBase:
autoflush,
replicas,
…
• Latency
measured
in
micro
second
• Easier
for
internal
analysis
YCSB
-‐
Yahoo!
Cloud
Serving
Benchmark
• Useful
for
comparison
between
databases
• Set
of
workload
already
defined
5.
Write path
• Two
parts
• Single
put
(WAL)
• The
client
just
sends
the
put
• MulJple
puts
from
the
client
(new
behavior
since
0.96)
• The
client
is
much
smarter
• Four
stages
to
look
at
for
latency
• Start
(establish
tcp
connecJons,
etc.)
• Steady:
when
expected
condiJons
are
met
• Machine
failure:
expected
as
well
• Overloaded
system
6.
Single put: communica>on & scheduling
• Client:
TCP
connecJon
to
the
server
• Shared:
mulJtheads
on
the
same
client
are
using
the
same
TCP
connecJon
• Pooling
is
possible
and
does
improve
the
performances
in
some
circonstances
• hbase.client.ipc.pool.size
• Server:
mulJple
calls
from
mulJple
threads
on
mulJple
machines
• Can
become
thousand
of
simultaneous
queries
• Scheduling
is
required
7.
Single put: real work
• The
server
must
• Write
into
the
WAL
queue
• Sync
the
WAL
queue
(HDFS
flush)
• Write
into
the
memstore
• WALs
queue
is
shared
between
all
the
regions/handlers
• Sync
is
avoided
if
another
handlers
did
the
work
• You
may
flush
more
than
expected
8.
Simple put: A small run
Percen&le
Time
in
ms
Mean
1.21
50%
0.95
95%
1.50
99%
2.12
9.
Latency sources
• Candidate
one:
network
• 0.5ms
within
a
datacenter
• Much
less
between
nodes
in
the
same
rack
Percen&le
Time
in
ms
Mean
0.13
50%
0.12
95%
0.15
99%
0.47
10.
Latency sources
• Candidate
two:
HDFS
Flush
• We
can
sJll
do
beier:
HADOOP-‐7714
&
sons.
Percen&le
Time
in
ms
Mean
0.33
50%
0.26
95%
0.59
99%
1.24
11.
Latency sources
• Millisecond
world:
everything
can
go
wrong
• JVM
• Network
• OS
Scheduler
• File
System
• All
this
goes
into
the
post
99%
percenJle
• Requires
monitoring
• Usually
using
the
latest
version
shelps.
12.
Latency sources
• Split
(and
presplits)
• Autosharding
is
great!
• Puts
have
to
wait
• Impacts:
seconds
• Balance
• Regions
move
• Triggers
a
retry
for
the
client
• hbase.client.pause
=
100ms
since
HBase
0.96
•
Garbage
CollecJon
• Impacts:
10’s
of
ms,
even
with
a
good
config
• Covered
with
the
read
path
of
this
talk
13.
From steady to loaded and overloaded
• Number
of
concurrent
tasks
is
a
factor
of
• Number
of
cores
• Number
of
disks
• Number
of
remote
machines
used
• Difficult
to
esJmate
• Queues
are
doomed
to
happen
• hbase.regionserver.handler.count
• So
for
low
latency
• Replable
scheduler
since
HBase
0.98
(HBASE-‐8884).
Requires
specific
code.
• RPC
PrioriJes:
work
in
progress
(HBASE-‐11048)
14.
From loaded to overloaded
• MemStore
takes
too
much
room:
flush,
then
blocksquite
quickly
• hbase.regionserver.global.memstore.size.lower.limit
• hbase.regionserver.global.memstore.size
• hbase.hregion.memstore.block.multiplier
• Too
many
Hfiles:
block
unJl
compacJons
keep
up
• hbase.hstore.blockingStoreFiles
• Too
many
WALs
files:
Flush
and
block
• hbase.regionserver.maxlogs
15.
Machine failure
• Failure
• Dectect
• Reallocate
• Replay
WAL
• Replaying
WAL
is
NOT
required
for
puts
• hbase.master.distributed.log.replay
• (default
true
in
1.0)
• Failure
=
Dectect
+
Reallocate
+
Retry
• That’s
in
the
range
of
~1s
for
simple
failures
• Silent
failures
leads
puts
you
in
the
10s
range
if
the
hardware
does
not
help
• zookeeper.session.timeout
16.
Single puts
• Millisecond
range
• Spikes
do
happen
in
steady
mode
• 100ms
• Causes:
GC,
load,
splits
17.
Streaming puts
Htable#setAutoFlushTo(false)!
Htable#put!
Htable#flushCommit!
• As
simple
puts,
but
• Puts
are
grouped
and
send
in
background
• Load
is
taken
into
account
• Does
not
block
18.
Mul>ple puts
hbase.client.max.total.tasks (default 100)
hbase.client.max.perserver.tasks (default 5)
hbase.client.max.perregion.tasks (default 1)
• Decouple
the
client
from
a
latency
spike
of
a
region
server
• Increase
the
throughput
by
50%
compared
to
old
mulJput
• Makes
split
and
GC
more
transparent
19.
Conclusion on write path
• Single
puts
can
be
very
fast
• It’s
not
a
«
hard
real
Jme
»
system:
there
are
spikes
• Most
latency
spikes
can
be
hidden
when
streaming
puts
• Failure
are
NOT
that
difficult
for
the
write
path
• No
WAL
to
replay
21.
Read path
• Get/short
scan
are
assumed
for
low-‐latency
operaJons
• Again,
two
APIs
• Single
get:
HTable#get(Get)
• MulJ-‐get:
HTable#get(List<Get>)
• Four
stages,
same
as
write
path
• Start
(tcp
connecJon,
…)
• Steady:
when
expected
condiJons
are
met
• Machine
failure:
expected
as
well
• Overloaded
system:
you
may
need
to
add
machines
or
tune
your
workload
22.
Mul> get / Client
Group
Gets
by
RegionServer
Execute
them
one
by
one
25.
Access latency magnides
Storage hierarchy: a different view
Dean/2009
Memory
is
100000x
faster
than
disk!
Disk
seek
=
10ms
26.
Known unknowns
• For
each
candidate
HFile
• Exclude
by
file
metadata
• Timestamp
• Rowkey
range
• Exclude
by
bloom
filter
StoreFileScanner#
shouldUseScanner()
27.
Unknown knowns
• Merge
sort
results
polled
from
Stores
• Seek
each
scanner
to
a
reference
KeyValue
• Retrieve
candidate
data
from
disk
• MulJple
HFiles
=>
mulitple
seeks
• hbase.storescanner.parallel.seek.enable=true
• Short
Circuit
Reads
• dfs.client.read.shortcircuit=true
• Block
locality
• Happy
clusters
compact!
HFileBlock#
readBlockData()
29.
BlockCache Showdown
• LruBlockCache
• Default,
onheap
• Quite
good
most
of
the
Jme
• EvicJons
impact
GC
• BucketCache
• Oxeap
alternaJve
• SerializaJon
overhead
• Large
memory
configuraJons
hip://www.n10k.com/blog/
blockcache-‐showdown/
L2
off-‐heap
BucketCache
makes
a
strong
showing
30.
Latency enemies: Garbage Collec>on
• Use
heap.
Not
too
much.
With
CMS.
• Max
heap
• 30GB
(compressed
pointers)
• 8-‐16GB
if
you
care
about
9’s
• Healthy
cluster
load
• regular,
reliable
collecJons
• 25-‐100ms
pause
on
regular
interval
• Overloaded
RegionServer
suffers
GC
overmuch
31.
Off-‐heap to the rescue?
• BucketCache
(0.96,
HBASE-‐7404)
• Network
interfaces
(HBASE-‐9535)
• MemStore
et
al
(HBASE-‐10191)
32.
Latency enemies: Compac>ons
• Fewer
HFiles
=>
fewer
seeks
• Evict
data
blocks!
• Evict
Index
blocks!!
• hfile.block.index.cacheonwrite
• Evict
bloom
blocks!!!
• hfile.block.bloom.cacheonwrite
• OS
buffer
cache
to
the
rescue
• Compactected
data
is
sJll
fresh
• Beier
than
going
all
the
way
back
to
disk
34.
Hedging our bets
• HDFS
Hedged
reads
(2.4,
HDFS-‐5776)
• Reads
on
secondary
DataNodes
• Strongly
consistent
• Works
at
the
HDFS
level
• Timeline
consistency
(HBASE-‐10070)
• Reads
on
«
Replica
Region
»
• Not
strongly
consistent
35.
Read latency in summary
• Steady
mode
• Cache
hit:
<
1
ms
• Cache
miss:
+
10
ms
per
seek
• WriJng
while
reading
=>
cache
churn
• GC:
25-‐100ms
pause
on
regular
interval
Network
request
+
(1
-‐
P(cache
hit))
*
(10
ms
*
seeks)
• Same
long
tail
issues
as
write
• Overloaded:
same
scheduling
issues
as
write
• ParJal
failures
hurt
a
lot
36.
HBase ranges for 99% latency
Put
Streamed
Mul&put
Get
Timeline
get
Steady
milliseconds
milliseconds
milliseconds
milliseconds
Failure
seconds
seconds
seconds
milliseconds
GC
10’s
of
milliseconds
milliseconds
10’s
of
milliseconds
milliseconds
37.
What’s next
• Less
GC
• Use
less
objects
• Oxeap
• Compressed
BlockCache
(HBASE-‐8894)
• Prefered
locaJon
(HBASE-‐4755)
• The
«
magical
1%
»
• Most
tools
stops
at
the
99%
latency
• What
happens
aRer
is
much
more
complex
38.
Thanks!
Nick
Dimiduk,
Hortonworks
(@xefyr)
Nicolas
Liochon,
Scaled
Risk
(@nkeywal)
HBaseCon
May
5,
2014