What’s low latency
Latency
is
about
percenJles
• Average
!=
50%
percenJle
• There
are
oRen
order
of
magnitudes
between
«
average
»
and
«
95
percenJle
»
• Post
99%
=
«
magical
1%
».
Work
in
progress
here.
• Meaning
from
micro
seconds
(High
Frequency
Trading)
to
seconds
(interacJve
queries)
• In
this
talk
milliseconds
Measure latency
bin/hbase
org.apache.hadoop.hbase.PerformanceEvaluaJon
• More
opJons
related
to
HBase:
autoflush,
replicas,
…
• Latency
measured
in
micro
second
• Easier
for
internal
analysis
YCSB
-‐
Yahoo!
Cloud
Serving
Benchmark
• Useful
for
comparison
between
databases
• Set
of
workload
already
defined
Write path
• Two
parts
• Single
put
(WAL)
• The
client
just
sends
the
put
• MulJple
puts
from
the
client
(new
behavior
since
0.96)
• The
client
is
much
smarter
• Four
stages
to
look
at
for
latency
• Start
(establish
tcp
connecJons,
etc.)
• Steady:
when
expected
condiJons
are
met
• Machine
failure:
expected
as
well
• Overloaded
system
Single put: communica>on & scheduling
• Client:
TCP
connecJon
to
the
server
• Shared:
mulJtheads
on
the
same
client
are
using
the
same
TCP
connecJon
• Pooling
is
possible
and
does
improve
the
performances
in
some
circonstances
• hbase.client.ipc.pool.size
• Server:
mulJple
calls
from
mulJple
threads
on
mulJple
machines
• Can
become
thousand
of
simultaneous
queries
• Scheduling
is
required
Single put: real work
• The
server
must
• Write
into
the
WAL
queue
• Sync
the
WAL
queue
(HDFS
flush)
• Write
into
the
memstore
• WALs
queue
is
shared
between
all
the
regions/handlers
• Sync
is
avoided
if
another
handlers
did
the
work
• You
may
flush
more
than
expected
Simple put: A small run
Percen&le
Time
in
ms
Mean
1.21
50%
0.95
95%
1.50
99%
2.12
Latency sources
• Candidate
one:
network
• 0.5ms
within
a
datacenter
• Much
less
between
nodes
in
the
same
rack
Percen&le
Time
in
ms
Mean
0.13
50%
0.12
95%
0.15
99%
0.47
Latency sources
• Candidate
two:
HDFS
Flush
• We
can
sJll
do
beier:
HADOOP-‐7714
&
sons.
Percen&le
Time
in
ms
Mean
0.33
50%
0.26
95%
0.59
99%
1.24
Latency sources
• Millisecond
world:
everything
can
go
wrong
• JVM
• Network
• OS
Scheduler
• File
System
• All
this
goes
into
the
post
99%
percenJle
• Requires
monitoring
• Usually
using
the
latest
version
shelps.
Latency sources
• Split
(and
presplits)
• Autosharding
is
great!
• Puts
have
to
wait
• Impacts:
seconds
• Balance
• Regions
move
• Triggers
a
retry
for
the
client
• hbase.client.pause
=
100ms
since
HBase
0.96
•
Garbage
CollecJon
• Impacts:
10’s
of
ms,
even
with
a
good
config
• Covered
with
the
read
path
of
this
talk
From steady to loaded and overloaded
• Number
of
concurrent
tasks
is
a
factor
of
• Number
of
cores
• Number
of
disks
• Number
of
remote
machines
used
• Difficult
to
esJmate
• Queues
are
doomed
to
happen
• hbase.regionserver.handler.count
• So
for
low
latency
• Replable
scheduler
since
HBase
0.98
(HBASE-‐8884).
Requires
specific
code.
• RPC
PrioriJes:
work
in
progress
(HBASE-‐11048)
From loaded to overloaded
• MemStore
takes
too
much
room:
flush,
then
blocksquite
quickly
• hbase.regionserver.global.memstore.size.lower.limit
• hbase.regionserver.global.memstore.size
• hbase.hregion.memstore.block.multiplier
• Too
many
Hfiles:
block
unJl
compacJons
keep
up
• hbase.hstore.blockingStoreFiles
• Too
many
WALs
files:
Flush
and
block
• hbase.regionserver.maxlogs
Machine failure
• Failure
• Dectect
• Reallocate
• Replay
WAL
• Replaying
WAL
is
NOT
required
for
puts
• hbase.master.distributed.log.replay
• (default
true
in
1.0)
• Failure
=
Dectect
+
Reallocate
+
Retry
• That’s
in
the
range
of
~1s
for
simple
failures
• Silent
failures
leads
puts
you
in
the
10s
range
if
the
hardware
does
not
help
• zookeeper.session.timeout
Mul>ple puts
hbase.client.max.total.tasks (default 100)
hbase.client.max.perserver.tasks (default 5)
hbase.client.max.perregion.tasks (default 1)
• Decouple
the
client
from
a
latency
spike
of
a
region
server
• Increase
the
throughput
by
50%
compared
to
old
mulJput
• Makes
split
and
GC
more
transparent
Conclusion on write path
• Single
puts
can
be
very
fast
• It’s
not
a
«
hard
real
Jme
»
system:
there
are
spikes
• Most
latency
spikes
can
be
hidden
when
streaming
puts
• Failure
are
NOT
that
difficult
for
the
write
path
• No
WAL
to
replay
Read path
• Get/short
scan
are
assumed
for
low-‐latency
operaJons
• Again,
two
APIs
• Single
get:
HTable#get(Get)
• MulJ-‐get:
HTable#get(List<Get>)
• Four
stages,
same
as
write
path
• Start
(tcp
connecJon,
…)
• Steady:
when
expected
condiJons
are
met
• Machine
failure:
expected
as
well
• Overloaded
system:
you
may
need
to
add
machines
or
tune
your
workload
Mul> get / Client
Group
Gets
by
RegionServer
Execute
them
one
by
one
Known unknowns
• For
each
candidate
HFile
• Exclude
by
file
metadata
• Timestamp
• Rowkey
range
• Exclude
by
bloom
filter
StoreFileScanner#
shouldUseScanner()
Unknown knowns
• Merge
sort
results
polled
from
Stores
• Seek
each
scanner
to
a
reference
KeyValue
• Retrieve
candidate
data
from
disk
• MulJple
HFiles
=>
mulitple
seeks
• hbase.storescanner.parallel.seek.enable=true
• Short
Circuit
Reads
• dfs.client.read.shortcircuit=true
• Block
locality
• Happy
clusters
compact!
HFileBlock#
readBlockData()
BlockCache Showdown
• LruBlockCache
• Default,
onheap
• Quite
good
most
of
the
Jme
• EvicJons
impact
GC
• BucketCache
• Oxeap
alternaJve
• SerializaJon
overhead
• Large
memory
configuraJons
hip://www.n10k.com/blog/
blockcache-‐showdown/
L2
off-‐heap
BucketCache
makes
a
strong
showing
Latency enemies: Garbage Collec>on
• Use
heap.
Not
too
much.
With
CMS.
• Max
heap
• 30GB
(compressed
pointers)
• 8-‐16GB
if
you
care
about
9’s
• Healthy
cluster
load
• regular,
reliable
collecJons
• 25-‐100ms
pause
on
regular
interval
• Overloaded
RegionServer
suffers
GC
overmuch
Off-‐heap to the rescue?
• BucketCache
(0.96,
HBASE-‐7404)
• Network
interfaces
(HBASE-‐9535)
• MemStore
et
al
(HBASE-‐10191)
Latency enemies: Compac>ons
• Fewer
HFiles
=>
fewer
seeks
• Evict
data
blocks!
• Evict
Index
blocks!!
• hfile.block.index.cacheonwrite
• Evict
bloom
blocks!!!
• hfile.block.bloom.cacheonwrite
• OS
buffer
cache
to
the
rescue
• Compactected
data
is
sJll
fresh
• Beier
than
going
all
the
way
back
to
disk
Hedging our bets
• HDFS
Hedged
reads
(2.4,
HDFS-‐5776)
• Reads
on
secondary
DataNodes
• Strongly
consistent
• Works
at
the
HDFS
level
• Timeline
consistency
(HBASE-‐10070)
• Reads
on
«
Replica
Region
»
• Not
strongly
consistent
Read latency in summary
• Steady
mode
• Cache
hit:
<
1
ms
• Cache
miss:
+
10
ms
per
seek
• WriJng
while
reading
=>
cache
churn
• GC:
25-‐100ms
pause
on
regular
interval
Network
request
+
(1
-‐
P(cache
hit))
*
(10
ms
*
seeks)
• Same
long
tail
issues
as
write
• Overloaded:
same
scheduling
issues
as
write
• ParJal
failures
hurt
a
lot
HBase ranges for 99% latency
Put
Streamed
Mul&put
Get
Timeline
get
Steady
milliseconds
milliseconds
milliseconds
milliseconds
Failure
seconds
seconds
seconds
milliseconds
GC
10’s
of
milliseconds
milliseconds
10’s
of
milliseconds
milliseconds
What’s next
• Less
GC
• Use
less
objects
• Oxeap
• Compressed
BlockCache
(HBASE-‐8894)
• Prefered
locaJon
(HBASE-‐4755)
• The
«
magical
1%
»
• Most
tools
stops
at
the
99%
latency
• What
happens
aRer
is
much
more
complex