HBase: Where Online Meets Low Latency

HBase Low Latency
Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
HBaseCon May 5, 2014

Agenda
• Latency, what is it, how to measure it
• Write path
• Read path
• Next steps

What’s low latency
Latency is about percentiles
• Long tail issue
• There are often order of magnitudes between « average » and « 95
percentile »
• Post 99% = « magical 1% ». Work in progress here.
• Meaning from micro seconds (High Frequency
Trading) to seconds (interactive queries)
• In this talk milliseconds

Measure latency – during test
bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation
• More options related to HBase: autoflush, replicas, …
• Latency measured in micro second
• Easier for internal analysis
• YCSB
• Useful for comparison between tools
• Set of workload already defined

Measure latency : Exposed by HBase
"QueueCallTime_num_ops" : 33044,
"QueueCallTime_min" : 0,
"QueueCallTime_max" : 86,
"QueueCallTime_mean" : 0.2525420651252875,
"QueueCallTime_median" : 0.0,
"QueueCallTime_75th_percentile" : 0.0,
a
"SyncTime_num_ops" : 379081,
"SyncTime_min" : 0,
"SyncTime_max" : 865,
"SyncTime_mean" : 3.0293341000999785,
"SyncTime_median" : 2.0,
"SyncTime_75th_percentile" : 3.0,

HBase write path – high level
RegionServer (HBase)
DataNode (Hadoop DFS)
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
1
2
3
4
5

Deeper in the write path
• Two parts
• Single put (WAL)
• The client just sends the put
• Multiple puts from the client (new behavior since 0.96)
• The client is much smarter
• Four stages to look at for latency
• Start (establish tcp connections, etc.)
• Steady: when expected conditions are met
• Machine failure: expected as well
• Overloaded system: you may need to add machines or tune your workload

Single put: communication
• Create a « Call » object, with an id, as queries are multiplexed
• protobuf it
• tcp write (in trunk it can be queued for a separate thread as well)
• Wait for the answer
• Separate thread, separate queue
• unprotobuf the answer
• Implies locks and multiple threads communicating with queues

Single put: server side scheduling
• Threads to receives « Call »
• Threads to handle the call execution
• Threads to write the answer on the wire
• Multiple threads, communicating with queues

Single put: real work
• The server must
• Take a row lock (HBase strong consistency)
• Write into the WAL queue
• Write into the memstore
• Sync the queue (HDFS flush)
• Free the lock
• WALs queue is shared between all the regions/handlers
• Sync is avoided if another handlers did the work
• You may flush more than expected

Latency sources
• Candidate one: network
• 0.5ms within a datacenter.
• Candidate two: HDFS Flush
• Millisecond world: everything can go wrong
• Network
• OS Scheduler
• All this goes into the post 99% percentile
Metric Time in ms
Mean 0.33
50% 0.26
95% 0.59
99% 1.24

Latency sources
• Split (and presplits)
• Autosharding is great!
• Puts have to wait
• Impacts: seconds
• Balance
• Regions move
• Triggers a retry for the client
• hbase.client.pause = 100ms since HBase 0.96
• Garbage Collection
• Impacts: 10’s of ms, even with a good config
• Covered with the read path of this talk

From steady to loaded and oveloaded
• Number of concurrent tasks is a factor of
• Number of cores
• Number of disks
• Number of remote machines used
• Difficult to estimate
• Queues are doomed to happen
• So for low latency
• Specific Scheduler since Hbase 0.98 (HBASE-8884). Requires specific code.
• Priorities: work in progress.

Loaded & overloaded
• Step 1: Loaded system
• Tasks are queued: creates latency
• Specific metric in HBase
• Step 2: Limit reached
• MemStore takes too much room: blocks until it’s flushed
• hbase.regionserver.global.memstore.size.lower.limit
• hbase.regionserver.global.memstore.size
• hbase.hregion.memstore.block.multiplier
• Too many Hfiles: blocks until compations keeps up
• hbase.hstore.blockingStoreFiles
• Too many WALs files
• Don’t change this

Machine failure
• Failure
• Dectect
• Reallocate
• Replay WAL
• Replaying WAL is NOT required for puts
• Failure = Dectect + Reallocate + Retry
• That’s in the range of ~1s for simple failures
• Silent failures leads puts you in the 10s range if the hardware does not help

Single puts
• Millisecond range
• Spikes do happen in steady mode
• 100ms
• Causes: GC, load, splits

Streaming puts
Htable#setAutoFlushTo(false)
Htable#put
Htable#flushCommit

Streaming puts
• Write into a buffer
• When the buffer is full, in the background
• Select the puts that matches load conditions
• Send them
• Manage retries and delay
• The buffer is freed for other client operations
• Blocks only if there is an a not retryable error or if the buffer is full

Multiple puts
• hbase.client.max.total.tasks (default 100)
• hbase.client.max.perserver.tasks (default 5)
• hbase.client.max.perregion.tasks (default 1)
• Decouple the client from a latency peak of a region server
• Increase the throughput by 50%
• Does not solve the problem of an unbalanced cluster
• But makes split and GC more transparent

Conclusion on write path
• Single puts can be very fast
• It’s not a « hard real time » system: there are peaks
• Latency peaks can be hidden when streaming puts
• Including autosplits

HBase read path – high level
RegionServer (HBase)
DataNode (Hadoop DFS)
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
1 5
2
3
3
2
4

Deeper in the read path
• Get/short scan are assumed for low-latency operations
• Again, two APIs
• Single get: HTable#get(Get)
• Multi-get: HTable#get(List<Get>)
• Four stages, same as write path
• Start (tcp connection, …)
• Steady: when expected conditions are met
• Machine failure: expected as well
• Overloaded system: you may need to add machines or tune your workload

Multi get / Client
Group Gets by
RegionServer

Multi get / Client
Execute them
one by one

Access latency magnidesStorage hierarchy: a different view
Dean/2009
Memory is 100000x
faster than disk!
Disk seek = 10ms

Known unknowns
• For each candidate HFile
• Exclude by file metadata
• Timestamp
• Rowkey range
• Exclude by bloom filter
• StoreFileManager (0.96, HBASE-7678)
StoreFileScanner#
shouldUseScanner()

Unknown knowns
• Merge sort results polled from Stores
• Seek each scanner to a reference KeyValue
• Retrieve candidate data from disk
• Multiple HFiles => mulitple seeks
• hbase.storescanner.parallel.seek.enable=true
• Short Circuit Reads
• dfs.client.read.shortcircuit=true
• Block locality
• Happy clusters compact!
HFileBlock#
readBlockData()

Remembered knowns: BlockCache
• Reuse previously read data
• Smaller BLOCKSIZE => better utilization
• TODO: compression (HBASE-8894)
BlockCache#getBlock()

BlockCache Showdown
• LruBlockCache
• Quite good most of the time
• < 30 GB
• BucketCache
• Offheap alternative
• > 30 GB
http://www.n10k.com/blog/block
cache-showdown/

Latency enemies: Compactions
• Fewer HFiles => fewer seeks
• Evict data blocks!
• Evict Index blocks!!
• hfile.block.index.cacheonwrite
• Evict bloom blocks!!!
• hfile.block.bloom.cacheonwrite
• OS buffer cache to the rescue
• Compactected data is still fresh
• Better than going all the way back to disk

Latency enemies: Garbage Collection
• Use Heap. Not too much. With CMS.
• Max heap: 30GB, probably less
• Healthy cluster load
• regular, reliable collections
• 25-100ms pause on regular interval
• Overloaded RegionServer suffers GC overmuch

Off-heap to the rescue?
• BucketCache (0.96, HBASE-7404)
• Network interfaces (HBASE-9535)
• MemStore et al (HBASE-10191)

Failure
• Machine failure
• Detect + Reallocate + Replay
• Strong consistency requires replay
• Cache starts from scratch

Read latency in summary
• Steady mode
• Cache hit: < 1 ms
• Cache miss: + 10 ms per seek
• Writing while reading: cache churn
• GC: 25-100ms pause on regular interval
Network request + (1 - P(cache hit)) * 10 ms
• Same long tail issues as write
• Overloaded: same scheduling issues as write
• Partial failures hurt a lot

Hedging our bets
• HDFS Hedged reads (since HDFS 2.4)
• Strongly consistent
• Works at the HDFS level
• Timeline consistency (HBASE-10070)
• Reads on secondary regions
• If a region does not answer quickly enough, go
to another one
• Not strongly consistent
• Helps a lot latency for read path.

HBase ranges for 99% latency
Put Streamed Multiput Get Timeline get
Steady milliseconds milliseconds milliseconds milliseconds
Failure seconds seconds seconds milliseconds
GC
10’s of
milliseconds milliseconds
10’s of
milliseconds milliseconds

What’s next
• Less GC
• Use less objects
• Offheap
• Prefered location (HBASE-4755)
• The « magical 1% »
• Most tools stops at the 99% latency
• YCSB for example
• What happens after is much more complex
• But key to improve average

Thanks!
Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
HBaseCon May 5, 2014

HBase: Where Online Meets Low Latency

More Related Content

What's hot

Viewers also liked

Similar to HBase: Where Online Meets Low Latency

More from HBaseCon

Recently uploaded

HBase: Where Online Meets Low Latency

Editor's Notes