We start by looking at distributed database features that impact latency. Then we take a deeper look at the HBase read and write paths with a focus on request latency. We examine the sources of latency and how to minimize them.
1. HBase Low Latency
Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
Strata New York, October 17, 2014
2. Agenda
• Latency, what is it, how to measure it
• Write path
• Read path
• Next steps
3. What’s low latency
• Meaning from micro seconds (High Frequency
Trading) to seconds (interactive queries)
• In this talk milliseconds
Latency is about percentiles
• Average != 50% percentile
• There are often order of magnitudes between « average » and « 95
percentile »
• Post 99% = « magical 1% ». Work in progress here.
4. Measure latency
YCSB - Yahoo! Cloud Serving Benchmark
• Useful for comparison between databases
• Set of workload already defined
bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation
• More options related to HBase: autoflush, replicas, …
• Latency measured in micro second
• Easier for internal analysis
5. Why is it important
Durability
Availability
Consistency
6. 3
1
Client
Buffer
0
Server
Buffer
HBase BigTable
OS
Buffer
on GFS1
Traditional
DB Engine
Disk
Durability
2
8. Consistency
Two processes: P1, P2
Counter updated by a P1
v1, then v2, then v3
Eventual consistency allows P1
and P2 to see these events in any
order.
Strong consistency allows only one
order
Google F1 paper, VLDB (2013)
We store financial data and have hard requirements
on data integrity and consistency. We also have a lot
of experience with eventual consistency systems at
Google. In all such systems, we find developers spend
a significant fraction of their time building extremely
complex and error-prone mechanisms to cope with
eventual consistency and handle data that may be
out of date. We think this is an unacceptable burden
to place on developers and that consistency problems
should be solved at the database level.
9. Consistency
Big Table design: allows consistency by partitioning the data: each
machine serves a subset of the data.
10. Availability
• Contract is: « a client outside the cluster will sees HBase as available if
there are partitions or failure within the HBase cluster »
• There is a lot more to say, but it’s outside the scope of this talk
(unfortunately)
12. Trade off
• Maximizing the benefits while minimizing
the cost
• Implementation details count
• Configuration counts
13. Write path
• Two parts
• Single put (WAL)
• The client just sends the put
• Multiple puts from the client (new behavior since 0.96)
• The client is much smarter
• Four stages to look at for latency
• Start (establish tcp connections, etc.)
• Steady: when expected conditions are met
• Machine failure: expected as well
• Overloaded system
14. Single put: communication & scheduling
• Client: TCP connection to the server
• Shared: multitheads on the same client are using the same TCP connection
• Pooling is possible and does improve the performances in some circonstances
• hbase.client.ipc.pool.size
• Server: multiple calls from multiple threads on multiple machines
• Can become thousand of simultaneous queries
• Scheduling is required
15. Single put: real work
• The server must
• Write into the WAL queue
• Sync the WAL queue (HDFS flush)
• Write into the memstore
• WALs queue is shared between all the regions/handlers
• Sync is avoided if another handlers did the work
• You may flush more than expected
16. Simple put: A small run
Percentile Time in ms
Mean 1.21
50% 0.95
95% 1.50
99% 2.12
17. Latency sources
• Candidate one: network
• 0.5ms within a datacenter
• Much less between nodes in the same rack
Percentile Time in ms
Mean 0.13
50% 0.12
95% 0.15
99% 0.47
18. Latency sources
• Candidate two: HDFS Flush
Percentile Time in ms
Mean 0.33
50% 0.26
95% 0.59
99% 1.24
• We can still do better: HADOOP-7714 & sons.
19. Latency sources
• Millisecond world: everything can go wrong
• JVM
• Network
• OS Scheduler
• File System
• All this goes into the post 99% percentile
• Requires monitoring
• Usually using the latest version shelps.
20. Latency sources
• Split (and presplits)
• Autosharding is great!
• Puts have to wait
• Impacts: seconds
• Balance
• Regions move
• Triggers a retry for the client
• hbase.client.pause = 100ms since HBase 0.96
• Garbage Collection
• Impacts: 10’s of ms, even with a good config
• Covered with the read path of this talk
21. From steady to loaded and overloaded
• Number of concurrent tasks is a factor of
• Number of cores
• Number of disks
• Number of remote machines used
• Difficult to estimate
• Queues are doomed to happen
• hbase.regionserver.handler.count
• So for low latency
• Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code.
• RPC Priorities: since 0.98 (HBASE-11048)
22. From loaded to overloaded
• MemStore takes too much room: flush, then blocksquite quickly
• hbase.regionserver.global.memstore.size.lower.limit
• hbase.regionserver.global.memstore.size
• hbase.hregion.memstore.block.multiplier
• Too many Hfiles: block until compactions keep up
• hbase.hstore.blockingStoreFiles
• Too manyWALs files: Flush and block
• hbase.regionserver.maxlogs
23. Machine failure
• Failure
• Dectect
• Reallocate
• Replay WAL
• ReplayingWAL is NOT required for puts
• hbase.master.distributed.log.replay
• (default true in 1.0)
• Failure = Dectect + Reallocate + Retry
• That’s in the range of ~1s for simple failures
• Silent failures leads puts you in the 10s range if the hardware does not help
• zookeeper.session.timeout
24. Single puts
• Millisecond range
• Spikes do happen in steady mode
• 100ms
• Causes: GC, load, splits
25. Streaming puts
Htable#setAutoFlushTo(false)
Htable#put
Htable#flushCommit
• As simple puts, but
• Puts are grouped and send in background
• Load is taken into account
• Does not block
26. Multiple puts
hbase.client.max.total.tasks (default 100)
hbase.client.max.perserver.tasks (default 5)
hbase.client.max.perregion.tasks (default 1)
• Decouple the client from a latency spike of a region server
• Increase the throughput by 50% compared to old multiput
• Makes split and GC more transparent
27. Conclusion on write path
• Single puts can be very fast
• It’s not a « hard real time » system: there are spikes
• Most latency spikes can be hidden when streaming puts
• Failure are NOT that difficult for the write path
• No WAL to replay
29. Read path
• Get/short scan are assumed for low-latency operations
• Again, two APIs
• Single get: HTable#get(Get)
• Multi-get: HTable#get(List<Get>)
• Four stages, same as write path
• Start (tcp connection, …)
• Steady: when expected conditions are met
• Machine failure: expected as well
• Overloaded system: you may need to add machines or tune your workload
30. Multi get / Client
Group Gets by
RegionServer
Execute them
one by one
33. AcceSstso rlaagtee hniecrya rmchya: gan diifdfeeresnt view
Dean/2009
Memory is 100000x
faster than disk!
Disk seek = 10ms
34. Known unknowns
• For each candidate HFile
• Exclude by file metadata
• Timestamp
• Rowkey range
• Exclude by bloom filter
StoreFileScanner#
shouldUseScanner()
35. Unknown knowns
• Merge sort results polled from Stores
• Seek each scanner to a reference KeyValue
• Retrieve candidate data from disk
• Multiple HFiles => mulitple seeks
• hbase.storescanner.parallel.seek.enable=true
• Short Circuit Reads
• dfs.client.read.shortcircuit=true
• Block locality
• Happy clusters compact!
HFileBlock#
readBlockData()
37. BlockCache Showdown
• LruBlockCache
• Default, onheap
• Quite good most of the time
• Evictions impact GC
• BucketCache
• Offheap alternative
• Serialization overhead
• Large memory configurations
http://www.n10k.com/blog/block
cache-showdown/
L2 off-heap BucketCache
makes a strong showing
38. Latency enemies: Garbage Collection
• Use heap. Not too much. With CMS.
• Max heap
• 30GB (compressed pointers)
• 8-16GB if you care about 9’s
• Healthy cluster load
• regular, reliable collections
• 25-100ms pause on regular interval
• Overloaded RegionServer suffers GC overmuch
39. Off-heap to the rescue?
• BucketCache (0.96, HBASE-7404)
• Network interfaces (HBASE-9535)
• MemStore et al (HBASE-10191)
40. Latency enemies: Compactions
• Fewer HFiles => fewer seeks
• Evict data blocks!
• Evict Index blocks!!
• hfile.block.index.cacheonwrite
• Evict bloom blocks!!!
• hfile.block.bloom.cacheonwrite
• OS buffer cache to the rescue
• Compactected data is still fresh
• Better than going all the way back to disk
42. Hedging our bets
• HDFS Hedged reads (2.4, HDFS-5776)
• Reads on secondary DataNodes
• Strongly consistent
• Works at the HDFS level
• Timeline consistency (HBASE-10070)
• Reads on « Replica Region »
• Not strongly consistent
43. Read latency in summary
• Steady mode
• Cache hit: < 1 ms
• Cache miss: + 10 ms per seek
• Writing while reading => cache churn
• GC: 25-100ms pause on regular interval
Network request + (1 - P(cache hit)) * (10 ms * seeks)
• Same long tail issues as write
• Overloaded: same scheduling issues as write
• Partial failures hurt a lot
44. HBase ranges for 99% latency
Put
Streamed
Multiput Get Timeline get
Steady milliseconds milliseconds milliseconds milliseconds
Failure seconds seconds seconds milliseconds
GC
10’s of
milliseconds milliseconds
10’s of
milliseconds milliseconds
45. What’s next
• Less GC
• Use less objects
• Offheap
✓Compressed BlockCache (HBASE-11331)
• Prefered location (HBASE-4755)
• The « magical 1% »
• Most tools stops at the 99% latency
• What happens after is much more complex
46. 35.0x
30.0x
25.0x
20.0x
15.0x
10.0x
5.0x
0.0x
Performance with Compressed BlockCache
Total RAM: 24G LruBlockCache Size: 12G
Data Size: 45G Compressed Size: 11G
Compression: SNAPPY
throughput (ops/sec) latency (ms, p95) latency (ms, p99) cpu load
Times improvement
Metric
Enabled
Disabled
47. Thanks!
Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
Strata New York, October 17, 2014
Editor's Notes
This talk: assume get/short scan implies low-latency requirements
No fancy streaming client like Puts; waiting for slowest RS.
Gets grouped by RS, groups sent in parallel, block until all groups return.
Network call: 10’s micros, in parallel
Read path in full detail is quite complex
See Lar’s HBaseCon 2012 talk for the nitty-gritty
Complexity optimizing around one invariant: 10ms seek
Read path in full detail is quite complex
See Lar’s HBaseCon 2012 talk for the nitty-gritty
Complexity optimizing around one invariant: 10ms seek
Complexity optimizing around one invariant: 10ms seek
Aiming for 100 microSec world; how to work around this?
Goal: avoid disk at all cost!
Goal: don’t go to disk unless absolutely necessary.
Tactic: Candidate HFile elimination.
Regular compactions => 3-5 candidates
Mergesort over multiple files, multiple seeks
More spindles = parallel scanning
SCR avoids proxy process (DataNode)
But remember! Goal: don’t go to disk unless absolutely necessary.
“block” is a segment of an HFile
Data blocks, index blocks, and bloom blocks
Read blocks retained in BlockCache
Seek to same and adjacent data become cache hits
HBase ships with a variety of cache implementations
Happy with 95% stats? Stick with LruBlockCache and modest heapsize
Pushing 99%? Lru still okay, but watch that heapsize.
Spent money on RAM? BucketCache
GC is a part of healthy operation
BlockCache garbage is awkward size and age, which means pauses
Pause time is a function of heap size
More like ~16GiB if you’re really worried about 99%
Overloaded: more cache evictions, time in GC
Why generate garbage at all?
GC are smart, but maybe we know our pain spots better?
Don’t know until we try
Necessary for fewer scan candidates, fewer seeks
Buffer cache to the rescue
“That’s steady state and overloaded; let’s talk about failure”
Replaying WALs takes time
Unlucky: no data locality, talking to remove DataNode
Empty cache
“Failure isn’t binary. What about the sick and the dying?”
Don’t wait for a slow machine!
Reads dominated by disk seek, so keep that data in memory
After cache miss, GC is the next candidate cause of latency
“Ideal formula”
P(cache hit): fn (cache size :: db size, request locality)
Sometimes the jitter dominates
Standard deployment, well designed schema
Millisecond responses, seconds for failure recovery, and GC at a regular interval
Everything we’ve focused on here is impactful of the 99%
Beyond that there’s a lot of interesting problems to solve
There’s always more work to be done
Generate less garbage
Compressed BlockCache
Improve recovery time and locality