Successfully reported this slideshow.
Your SlideShare is downloading. ×

HBase Low Latency

More Related Content

More from DataWorks Summit

Related Books

Free with a 30 day trial from Scribd

See all

HBase Low Latency

  1. 1. HBase Low Latency Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Hadoop Summit June 4, 2014
  2. 2. Agenda • Latency, what is it, how to measure it • Write path • Read path • Next steps
  3. 3. What’s low latency Latency is about percentiles • Average != 50% percentile • There are often order of magnitudes between « average » and « 95 percentile » • Post 99% = « magical 1% ». Work in progress here. • Meaning from micro seconds (High Frequency Trading) to seconds (interactive queries) • In this talk milliseconds
  4. 4. Measure latency bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation • More options related to HBase: autoflush, replicas, … • Latency measured in micro second • Easier for internal analysis YCSB - Yahoo! Cloud Serving Benchmark • Useful for comparison between databases • Set of workload already defined
  5. 5. Write path • Two parts • Single put (WAL) • The client just sends the put • Multiple puts from the client (new behavior since 0.96) • The client is much smarter • Four stages to look at for latency • Start (establish tcp connections, etc.) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system
  6. 6. Single put: communication & scheduling • Client: TCP connection to the server • Shared: multitheads on the same client are using the same TCP connection • Pooling is possible and does improve the performances in some circonstances • hbase.client.ipc.pool.size • Server: multiple calls from multiple threads on multiple machines • Can become thousand of simultaneous queries • Scheduling is required
  7. 7. Single put: real work • The server must • Write into the WAL queue • Sync the WAL queue (HDFS flush) • Write into the memstore • WALs queue is shared between all the regions/handlers • Sync is avoided if another handlers did the work • Your handler may flush more data than expected
  8. 8. Simple put: A small run Percentile Time in ms Mean 1.21 50% 0.95 95% 1.50 99% 2.12
  9. 9. Latency sources • Candidate one: network • 0.5ms within a datacenter • Much less between nodes in the same rack Percentile Time in ms Mean 0.13 50% 0.12 95% 0.15 99% 0.47
  10. 10. Latency sources • Candidate two: HDFS Flush • We can still do better: HADOOP-7714 & sons. Percentile Time in ms Mean 0.33 50% 0.26 95% 0.59 99% 1.24
  11. 11. Latency sources • Millisecond world: everything can go wrong • JVM • Network • OS Scheduler • File System • All this goes into the post 99% percentile • Requires monitoring • Usually using the latest version helps
  12. 12. Latency sources • Split (and presplits) • Autosharding is great! • Puts have to wait • Impacts: seconds • Balance • Regions move • Triggers a retry for the client • hbase.client.pause = 100ms since HBase 0.96 • Garbage Collection • Impacts: 10’s of ms, even with a good config • Covered with the read path of this talk
  13. 13. From steady to loaded and overloaded • Number of concurrent tasks is a factor of • Number of cores • Number of disks • Number of remote machines used • Difficult to estimate • Queues are doomed to happen • hbase.regionserver.handler.count • So for low latency • Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code. • RPC Priorities: work in progress (HBASE-11048)
  14. 14. From loaded to overloaded • MemStore takes too much room: flush, then blocksquite quickly • • • hbase.hregion.memstore.block.multiplier • Too many Hfiles: block until compactions keep up • hbase.hstore.blockingStoreFiles • Too many WALs files: Flush and block • hbase.regionserver.maxlogs
  15. 15. Machine failure • Failure • Dectect • Reallocate • Replay WAL • Replaying WAL is NOT required for puts • hbase.master.distributed.log.replay • (default true in 1.0) • Failure = Dectect + Reallocate + Retry • That’s in the range of ~1s for simple failures • Silent failures leads puts you in the 10s range if the hardware does not help • zookeeper.session.timeout
  16. 16. Single puts • Millisecond range • Spikes do happen in steady mode • 100ms • Causes: GC, load, splits
  17. 17. Streaming puts Htable#setAutoFlushTo(false) Htable#put Htable#flushCommit • As simple puts, but • Puts are grouped and send in background • Load is taken into account • Does not block
  18. 18. Multiple puts (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1) • Decouple the client from a latency spike of a region server • Increase the throughput by 50% compared to old multiput • Makes split and GC more transparent
  19. 19. Conclusion on write path • Single puts can be very fast • It’s not a « hard real time » system: there are spikes • Most latency spikes can be hidden when streaming puts • Failure are NOT that difficult for the write path • No WAL to replay
  20. 20. And now for the read path
  21. 21. Read path • Get/short scan are assumed for low-latency operations • Again, two APIs • Single get: HTable#get(Get) • Multi-get: HTable#get(List<Get>) • Four stages, same as write path • Start (tcp connection, …) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system: you may need to add machines or tune your workload
  22. 22. Multi get / Client Group Gets by RegionServer Execute them one by one
  23. 23. Multi get / Server
  24. 24. Multi get / Server
  25. 25. Access latency magnidesStorage hierarchy: a different view Dean/2009 Memory is 100000x faster than disk! Disk seek = 10ms
  26. 26. Known unknowns • For each candidate HFile • Exclude by file metadata • Timestamp • Rowkey range • Exclude by bloom filter StoreFileScanner# shouldUseScanner()
  27. 27. Unknown knowns • Merge sort results polled from Stores • Seek each scanner to a reference KeyValue • Retrieve candidate data from disk • Multiple HFiles => mulitple seeks • • Short Circuit Reads • • Block locality • Happy clusters compact! HFileBlock# readBlockData()
  28. 28. BlockCache • Reuse previously read data • Maximize cache hit rate • Larger cache • Temporal access locality • Physical access locality BlockCache#getBlock()
  29. 29. BlockCache Showdown • LruBlockCache • Default, onheap • Quite good most of the time • Evictions impact GC • BucketCache • Offheap alternative • Serialization overhead • Large memory configurations cache-showdown/ L2 off-heap BucketCache makes a strong showing
  30. 30. Latency enemies: Garbage Collection • Use heap. Not too much. With CMS. • Max heap • 30GB (compressed pointers) • 8-16GB if you care about 9’s • Healthy cluster load • regular, reliable collections • 25-100ms pause on regular interval • Overloaded RegionServer suffers GC overmuch
  31. 31. Off-heap to the rescue? • BucketCache (0.96, HBASE-7404) • Network interfaces (HBASE-9535) • MemStore et al (HBASE-10191)
  32. 32. Latency enemies: Compactions • Fewer HFiles => fewer seeks • Evict data blocks! • Evict Index blocks!! • hfile.block.index.cacheonwrite • Evict bloom blocks!!! • hfile.block.bloom.cacheonwrite • OS buffer cache to the rescue • Compactected data is still fresh • Better than going all the way back to disk
  33. 33. Failure • Detect + Reassign + Replay • Strong consistency requires replay • Locality drops to 0 • Cache starts from scratch
  34. 34. Hedging our bets • HDFS Hedged reads (2.4, HDFS-5776) • Reads on secondary DataNodes • Strongly consistent • Works at the HDFS level • Timeline consistency (HBASE-10070) • Reads on « Replica Region » • Not strongly consistent
  35. 35. Read latency in summary • Steady mode • Cache hit: < 1 ms • Cache miss: + 10 ms per seek • Writing while reading => cache churn • GC: 25-100ms pause on regular interval Network request + (1 - P(cache hit)) * (10 ms * seeks) • Same long tail issues as write • Overloaded: same scheduling issues as write • Partial failures hurt a lot
  36. 36. HBase ranges for 99% latency Put Streamed Multiput Get Timeline get Steady milliseconds milliseconds milliseconds milliseconds Failure seconds seconds seconds milliseconds GC 10’s of milliseconds milliseconds 10’s of milliseconds milliseconds
  37. 37. What’s next • Less GC • Use less objects • Offheap • Compressed BlockCache (HBASE-8894) • Prefered location (HBASE-4755) • The « magical 1% » • Most tools stops at the 99% latency • What happens after is much more complex
  38. 38. Thanks! Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Hadoop Summit June 4, 2014

Editor's Notes

  • Random read/write database, latency is very important
  • Micro-seconds? Seconds? We are talking milliseconds
    Everyone stops looking after the 99% -- the literature calls this “magical 1%”
  • How to measure latency in HBase
  • Client connects to region server with TCP connection
    Connection is shared by client threads
    Server manages lots of client connections
    Schedules client queries – synchronization, locks, queues, &c
  • WAL queue shared between regions
    - sometimes the sync work has already been done for you, can help
    - sometimes your small edit is sync’d with another larger one, can hurt you
  • Small cluster, 4-yo machines
    1 put, 1 put, 1put…
    99% is double the mean
    Servers doing other work, but nothing major
  • Where do we spend time? How about the network?
    0.5ms is conservative, usually much less
    TCP round trip, same cluster as previous slide
    99% 4x mean
  • Where do we spend time? How about HDFS?
    Flushing, writing, &c
    Just HDFS, 2.4, 1kb, flush, flush, flush
    99% 5x mean – not bad
  • What else?
    Millisecond world means minor things start to matter:
    - JVM 1.7 is better at blocking queues than 1.6
    - forget to configure naggle? 40ms
    - older linux scheduler bugs, 50ms
    - facebook literature talks a lot about filesystems
  • Regular cluster operations also hurt
    Region split – 1’s of seconds
    Region move – 100ms (better on 0.96)
    GC – 10’s ms even after configuration tuning
  • Now add some load
    Load == concurrency == queues: 1ms
    0.98 adds pluggable scheduler, so you can influence these queues yourself
    Work in progress to expose this through configuration
  • Contract: save you from cluster explosion
    How to protect? Stop writing
    - too many WALs? Stop writes.
    - too many hfiles to compact? Stop writes.
    Lasts indefinitely. New default settings.
  • Puts do not require reads, so WAL replay doesn’t block writes
    Simple crash: quickly detected, 1s
    - conservative detection takes longer
    - configured to 5-15s, but look like long GC pause
  • New since 0.96
  • Settings for average cluster, unaggressive client
    Decoupled client from issue of a single RS
    YSCB, single empty region, 50% better throughput (HBASE-6295)
  • This talk: assume get/short scan implies low-latency requirements
    No fancy streaming client like Puts; waiting for slowest RS.
  • Gets grouped by RS, groups sent in parallel, block until all groups return.
    Network call: 10’s micros, in parallel
  • Read path in full detail is quite complex
    See Lar’s HBaseCon 2012 talk for the nitty-gritty
    Complexity optimizing around one invariant: 10ms seek
  • Read path in full detail is quite complex
    See Lar’s HBaseCon 2012 talk for the nitty-gritty
    Complexity optimizing around one invariant: 10ms seek
  • Complexity optimizing around one invariant: 10ms seek
    Aiming for 100 microSec world; how to work around this?
    Goal: avoid disk at all cost!
  • Goal: don’t go to disk unless absolutely necessary.
    Tactic: Candidate HFile elimination.
    Regular compactions => 3-5 candidates
    Bloom filters on by default in 0.96+
  • Mergesort over multiple files, multiple seeks
    More spindles = parallel scanning
    SCR avoids proxy process (DataNode)
    But remember! Goal: don’t go to disk unless absolutely necessary.
  • “block” is a segment of an HFile
    Data blocks, index blocks, and bloom blocks
    Read blocks retained in BlockCache
    Seek to same and adjacent data become cache hits
  • HBase ships with a variety of cache implementations
    Happy with 95% stats? Stick with LruBlockCache and modest heapsize
    Pushing 99%? Lru still okay, but watch that heapsize.
    Spent money on RAM? BucketCache
  • GC is a part of healthy operation
    BlockCache garbage is awkward size and age, which means pauses
    Pause time is a function of heap size
    More like ~16GiB if you’re really worried about 99%
    Overloaded: more cache evictions, time in GC
  • Why generate garbage at all?
    GC are smart, but maybe we know our pain spots better?
    Don’t know until we try
  • Necessary for fewer scan candidates, fewer seeks
    Buffer cache to the rescue
    “That’s steady state and overloaded; let’s talk about failure”
  • Replaying WALs takes time
    Unlucky: no data locality, talking to remove DataNode
    Empty cache
    “Failure isn’t binary. What about the sick and the dying?”
  • Don’t wait for a slow machine!
  • Reads dominated by disk seek, so keep that data in memory
    After cache miss, GC is the next candidate cause of latency
    “Ideal formula”
    P(cache hit): fn (cache size :: db size, request locality)
    Sometimes the jitter dominates
  • Standard deployment, well designed schema
    Millisecond responses, seconds for failure recovery, and GC at a regular interval
    Everything we’ve focused on here is impactful of the 99%
    Beyond that there’s a lot of interesting problems to solve
  • There’s always more work to be done
    Generate less garbage
    Compressed BlockCache
    Improve recovery time and locality
  • Questions!