• Like
HBase Low Latency
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.



Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Random read/write database, latency is very important
  • Micro-seconds? Seconds? We are talking milliseconds
    Everyone stops looking after the 99% -- the literature calls this “magical 1%”
  • How to measure latency in HBase
  • Client connects to region server with TCP connection
    Connection is shared by client threads
    Server manages lots of client connections
    Schedules client queries – synchronization, locks, queues, &c
  • WAL queue shared between regions
    - sometimes the sync work has already been done for you, can help
    - sometimes your small edit is sync’d with another larger one, can hurt you
  • Small cluster, 4-yo machines
    1 put, 1 put, 1put…
    99% is double the mean
    Servers doing other work, but nothing major
  • Where do we spend time? How about the network?
    0.5ms is conservative, usually much less
    TCP round trip, same cluster as previous slide
    99% 4x mean
  • Where do we spend time? How about HDFS?
    Flushing, writing, &c
    Just HDFS, 2.4, 1kb, flush, flush, flush
    99% 5x mean – not bad
  • What else?
    Millisecond world means minor things start to matter:
    - JVM 1.7 is better at blocking queues than 1.6
    - forget to configure naggle? 40ms
    - older linux scheduler bugs, 50ms
    - facebook literature talks a lot about filesystems
  • Regular cluster operations also hurt
    Region split – 1’s of seconds
    Region move – 100ms (better on 0.96)
    GC – 10’s ms even after configuration tuning
  • Now add some load
    Load == concurrency == queues: 1ms
    0.98 adds pluggable scheduler, so you can influence these queues yourself
    Work in progress to expose this through configuration
  • Contract: save you from cluster explosion
    How to protect? Stop writing
    - too many WALs? Stop writes.
    - too many hfiles to compact? Stop writes.
    Lasts indefinitely. New default settings.
  • Puts do not require reads, so WAL replay doesn’t block writes
    Simple crash: quickly detected, 1s
    - conservative detection takes longer
    - configured to 5-15s, but look like long GC pause
  • New since 0.96
  • Settings for average cluster, unaggressive client
    Decoupled client from issue of a single RS
    YSCB, single empty region, 50% better throughput (HBASE-6295)
  • This talk: assume get/short scan implies low-latency requirements
    No fancy streaming client like Puts; waiting for slowest RS.
  • Gets grouped by RS, groups sent in parallel, block until all groups return.
    Network call: 10’s micros, in parallel
  • Read path in full detail is quite complex
    See Lar’s HBaseCon 2012 talk for the nitty-gritty
    Complexity optimizing around one invariant: 10ms seek
  • Read path in full detail is quite complex
    See Lar’s HBaseCon 2012 talk for the nitty-gritty
    Complexity optimizing around one invariant: 10ms seek
  • Complexity optimizing around one invariant: 10ms seek
    Aiming for 100 microSec world; how to work around this?
    Goal: avoid disk at all cost!
  • Goal: don’t go to disk unless absolutely necessary.
    Tactic: Candidate HFile elimination.
    Regular compactions => 3-5 candidates
    Bloom filters on by default in 0.96+
  • Mergesort over multiple files, multiple seeks
    More spindles = parallel scanning
    SCR avoids proxy process (DataNode)
    But remember! Goal: don’t go to disk unless absolutely necessary.
  • “block” is a segment of an HFile
    Data blocks, index blocks, and bloom blocks
    Read blocks retained in BlockCache
    Seek to same and adjacent data become cache hits
  • HBase ships with a variety of cache implementations
    Happy with 95% stats? Stick with LruBlockCache and modest heapsize
    Pushing 99%? Lru still okay, but watch that heapsize.
    Spent money on RAM? BucketCache
  • GC is a part of healthy operation
    BlockCache garbage is awkward size and age, which means pauses
    Pause time is a function of heap size
    More like ~16GiB if you’re really worried about 99%
    Overloaded: more cache evictions, time in GC
  • Why generate garbage at all?
    GC are smart, but maybe we know our pain spots better?
    Don’t know until we try
  • Necessary for fewer scan candidates, fewer seeks
    Buffer cache to the rescue
    “That’s steady state and overloaded; let’s talk about failure”
  • Replaying WALs takes time
    Unlucky: no data locality, talking to remove DataNode
    Empty cache
    “Failure isn’t binary. What about the sick and the dying?”
  • Don’t wait for a slow machine!
  • Reads dominated by disk seek, so keep that data in memory
    After cache miss, GC is the next candidate cause of latency
    “Ideal formula”
    P(cache hit): fn (cache size :: db size, request locality)
    Sometimes the jitter dominates
  • Standard deployment, well designed schema
    Millisecond responses, seconds for failure recovery, and GC at a regular interval
    Everything we’ve focused on here is impactful of the 99%
    Beyond that there’s a lot of interesting problems to solve
  • There’s always more work to be done
    Generate less garbage
    Compressed BlockCache
    Improve recovery time and locality
  • Questions!


  • 1. HBase Low Latency Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Hadoop Summit June 4, 2014
  • 2. Agenda • Latency, what is it, how to measure it • Write path • Read path • Next steps
  • 3. What’s low latency Latency is about percentiles • Average != 50% percentile • There are often order of magnitudes between « average » and « 95 percentile » • Post 99% = « magical 1% ». Work in progress here. • Meaning from micro seconds (High Frequency Trading) to seconds (interactive queries) • In this talk milliseconds
  • 4. Measure latency bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation • More options related to HBase: autoflush, replicas, … • Latency measured in micro second • Easier for internal analysis YCSB - Yahoo! Cloud Serving Benchmark • Useful for comparison between databases • Set of workload already defined
  • 5. Write path • Two parts • Single put (WAL) • The client just sends the put • Multiple puts from the client (new behavior since 0.96) • The client is much smarter • Four stages to look at for latency • Start (establish tcp connections, etc.) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system
  • 6. Single put: communication & scheduling • Client: TCP connection to the server • Shared: multitheads on the same client are using the same TCP connection • Pooling is possible and does improve the performances in some circonstances • hbase.client.ipc.pool.size • Server: multiple calls from multiple threads on multiple machines • Can become thousand of simultaneous queries • Scheduling is required
  • 7. Single put: real work • The server must • Write into the WAL queue • Sync the WAL queue (HDFS flush) • Write into the memstore • WALs queue is shared between all the regions/handlers • Sync is avoided if another handlers did the work • Your handler may flush more data than expected
  • 8. Simple put: A small run Percentile Time in ms Mean 1.21 50% 0.95 95% 1.50 99% 2.12
  • 9. Latency sources • Candidate one: network • 0.5ms within a datacenter • Much less between nodes in the same rack Percentile Time in ms Mean 0.13 50% 0.12 95% 0.15 99% 0.47
  • 10. Latency sources • Candidate two: HDFS Flush • We can still do better: HADOOP-7714 & sons. Percentile Time in ms Mean 0.33 50% 0.26 95% 0.59 99% 1.24
  • 11. Latency sources • Millisecond world: everything can go wrong • JVM • Network • OS Scheduler • File System • All this goes into the post 99% percentile • Requires monitoring • Usually using the latest version helps
  • 12. Latency sources • Split (and presplits) • Autosharding is great! • Puts have to wait • Impacts: seconds • Balance • Regions move • Triggers a retry for the client • hbase.client.pause = 100ms since HBase 0.96 • Garbage Collection • Impacts: 10’s of ms, even with a good config • Covered with the read path of this talk
  • 13. From steady to loaded and overloaded • Number of concurrent tasks is a factor of • Number of cores • Number of disks • Number of remote machines used • Difficult to estimate • Queues are doomed to happen • hbase.regionserver.handler.count • So for low latency • Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code. • RPC Priorities: work in progress (HBASE-11048)
  • 14. From loaded to overloaded • MemStore takes too much room: flush, then blocksquite quickly • hbase.regionserver.global.memstore.size.lower.limit • hbase.regionserver.global.memstore.size • hbase.hregion.memstore.block.multiplier • Too many Hfiles: block until compactions keep up • hbase.hstore.blockingStoreFiles • Too many WALs files: Flush and block • hbase.regionserver.maxlogs
  • 15. Machine failure • Failure • Dectect • Reallocate • Replay WAL • Replaying WAL is NOT required for puts • hbase.master.distributed.log.replay • (default true in 1.0) • Failure = Dectect + Reallocate + Retry • That’s in the range of ~1s for simple failures • Silent failures leads puts you in the 10s range if the hardware does not help • zookeeper.session.timeout
  • 16. Single puts • Millisecond range • Spikes do happen in steady mode • 100ms • Causes: GC, load, splits
  • 17. Streaming puts Htable#setAutoFlushTo(false) Htable#put Htable#flushCommit • As simple puts, but • Puts are grouped and send in background • Load is taken into account • Does not block
  • 18. Multiple puts hbase.client.max.total.tasks (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1) • Decouple the client from a latency spike of a region server • Increase the throughput by 50% compared to old multiput • Makes split and GC more transparent
  • 19. Conclusion on write path • Single puts can be very fast • It’s not a « hard real time » system: there are spikes • Most latency spikes can be hidden when streaming puts • Failure are NOT that difficult for the write path • No WAL to replay
  • 20. And now for the read path
  • 21. Read path • Get/short scan are assumed for low-latency operations • Again, two APIs • Single get: HTable#get(Get) • Multi-get: HTable#get(List<Get>) • Four stages, same as write path • Start (tcp connection, …) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system: you may need to add machines or tune your workload
  • 22. Multi get / Client Group Gets by RegionServer Execute them one by one
  • 23. Multi get / Server
  • 24. Multi get / Server
  • 25. Access latency magnidesStorage hierarchy: a different view Dean/2009 Memory is 100000x faster than disk! Disk seek = 10ms
  • 26. Known unknowns • For each candidate HFile • Exclude by file metadata • Timestamp • Rowkey range • Exclude by bloom filter StoreFileScanner# shouldUseScanner()
  • 27. Unknown knowns • Merge sort results polled from Stores • Seek each scanner to a reference KeyValue • Retrieve candidate data from disk • Multiple HFiles => mulitple seeks • hbase.storescanner.parallel.seek.enable=true • Short Circuit Reads • dfs.client.read.shortcircuit=true • Block locality • Happy clusters compact! HFileBlock# readBlockData()
  • 28. BlockCache • Reuse previously read data • Maximize cache hit rate • Larger cache • Temporal access locality • Physical access locality BlockCache#getBlock()
  • 29. BlockCache Showdown • LruBlockCache • Default, onheap • Quite good most of the time • Evictions impact GC • BucketCache • Offheap alternative • Serialization overhead • Large memory configurations http://www.n10k.com/blog/block cache-showdown/ L2 off-heap BucketCache makes a strong showing
  • 30. Latency enemies: Garbage Collection • Use heap. Not too much. With CMS. • Max heap • 30GB (compressed pointers) • 8-16GB if you care about 9’s • Healthy cluster load • regular, reliable collections • 25-100ms pause on regular interval • Overloaded RegionServer suffers GC overmuch
  • 31. Off-heap to the rescue? • BucketCache (0.96, HBASE-7404) • Network interfaces (HBASE-9535) • MemStore et al (HBASE-10191)
  • 32. Latency enemies: Compactions • Fewer HFiles => fewer seeks • Evict data blocks! • Evict Index blocks!! • hfile.block.index.cacheonwrite • Evict bloom blocks!!! • hfile.block.bloom.cacheonwrite • OS buffer cache to the rescue • Compactected data is still fresh • Better than going all the way back to disk
  • 33. Failure • Detect + Reassign + Replay • Strong consistency requires replay • Locality drops to 0 • Cache starts from scratch
  • 34. Hedging our bets • HDFS Hedged reads (2.4, HDFS-5776) • Reads on secondary DataNodes • Strongly consistent • Works at the HDFS level • Timeline consistency (HBASE-10070) • Reads on « Replica Region » • Not strongly consistent
  • 35. Read latency in summary • Steady mode • Cache hit: < 1 ms • Cache miss: + 10 ms per seek • Writing while reading => cache churn • GC: 25-100ms pause on regular interval Network request + (1 - P(cache hit)) * (10 ms * seeks) • Same long tail issues as write • Overloaded: same scheduling issues as write • Partial failures hurt a lot
  • 36. HBase ranges for 99% latency Put Streamed Multiput Get Timeline get Steady milliseconds milliseconds milliseconds milliseconds Failure seconds seconds seconds milliseconds GC 10’s of milliseconds milliseconds 10’s of milliseconds milliseconds
  • 37. What’s next • Less GC • Use less objects • Offheap • Compressed BlockCache (HBASE-8894) • Prefered location (HBASE-4755) • The « magical 1% » • Most tools stops at the 99% latency • What happens after is much more complex
  • 38. Thanks! Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Hadoop Summit June 4, 2014