Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBase Low Latency, StrataNYC 2014


Published on

We start by looking at distributed database features that impact latency. Then we take a deeper look at the HBase read and write paths with a focus on request latency. We examine the sources of latency and how to minimize them.

Published in: Technology
  • The Insider's Edge You've Been Looking For.... ●●●
    Are you sure you want to  Yes  No
    Your message goes here
  • Profit Maximiser redefined the notion of exploiting bookie offers as a longer-term, rather than a one-off opportunity. Seasoned users report steady month-by-month profits and support each other through a famously busy, private facebook group. The winner of our best matched betting product oscar has matured into something very, very special. ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here
  • Of course I don't mind sharing a few words about your service, it was a roller coaster ride for me in the betting world before joining forces with you. It’s the best thing that’s happened to me financially. I've been able to pack in my part time job, have more time with the kids and have more money than I can spend. AMAZING!! ➤➤
    Are you sure you want to  Yes  No
    Your message goes here
  • Winning the Lottery is Based on This [7 Time Winner Tells All] ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here

HBase Low Latency, StrataNYC 2014

  1. 1. HBase Low Latency Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Strata New York, October 17, 2014
  2. 2. Agenda • Latency, what is it, how to measure it • Write path • Read path • Next steps
  3. 3. What’s low latency • Meaning from micro seconds (High Frequency Trading) to seconds (interactive queries) • In this talk milliseconds Latency is about percentiles • Average != 50% percentile • There are often order of magnitudes between « average » and « 95 percentile » • Post 99% = « magical 1% ». Work in progress here.
  4. 4. Measure latency YCSB - Yahoo! Cloud Serving Benchmark • Useful for comparison between databases • Set of workload already defined bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation • More options related to HBase: autoflush, replicas, … • Latency measured in micro second • Easier for internal analysis
  5. 5. Why is it important Durability Availability Consistency
  6. 6. 3 1 Client Buffer 0 Server Buffer HBase BigTable OS Buffer on GFS1 Traditional DB Engine Disk Durability 2
  7. 7. Durability
  8. 8. Consistency Two processes: P1, P2 Counter updated by a P1 v1, then v2, then v3 Eventual consistency allows P1 and P2 to see these events in any order. Strong consistency allows only one order Google F1 paper, VLDB (2013) We store financial data and have hard requirements on data integrity and consistency. We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level.
  9. 9. Consistency Big Table design: allows consistency by partitioning the data: each machine serves a subset of the data.
  10. 10. Availability • Contract is: « a client outside the cluster will sees HBase as available if there are partitions or failure within the HBase cluster » • There is a lot more to say, but it’s outside the scope of this talk (unfortunately)
  11. 11. Availability A partition or a machine failure appear to the client as a latency spike
  12. 12. Trade off • Maximizing the benefits while minimizing the cost • Implementation details count • Configuration counts
  13. 13. Write path • Two parts • Single put (WAL) • The client just sends the put • Multiple puts from the client (new behavior since 0.96) • The client is much smarter • Four stages to look at for latency • Start (establish tcp connections, etc.) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system
  14. 14. Single put: communication & scheduling • Client: TCP connection to the server • Shared: multitheads on the same client are using the same TCP connection • Pooling is possible and does improve the performances in some circonstances • hbase.client.ipc.pool.size • Server: multiple calls from multiple threads on multiple machines • Can become thousand of simultaneous queries • Scheduling is required
  15. 15. Single put: real work • The server must • Write into the WAL queue • Sync the WAL queue (HDFS flush) • Write into the memstore • WALs queue is shared between all the regions/handlers • Sync is avoided if another handlers did the work • You may flush more than expected
  16. 16. Simple put: A small run Percentile Time in ms Mean 1.21 50% 0.95 95% 1.50 99% 2.12
  17. 17. Latency sources • Candidate one: network • 0.5ms within a datacenter • Much less between nodes in the same rack Percentile Time in ms Mean 0.13 50% 0.12 95% 0.15 99% 0.47
  18. 18. Latency sources • Candidate two: HDFS Flush Percentile Time in ms Mean 0.33 50% 0.26 95% 0.59 99% 1.24 • We can still do better: HADOOP-7714 & sons.
  19. 19. Latency sources • Millisecond world: everything can go wrong • JVM • Network • OS Scheduler • File System • All this goes into the post 99% percentile • Requires monitoring • Usually using the latest version shelps.
  20. 20. Latency sources • Split (and presplits) • Autosharding is great! • Puts have to wait • Impacts: seconds • Balance • Regions move • Triggers a retry for the client • hbase.client.pause = 100ms since HBase 0.96 • Garbage Collection • Impacts: 10’s of ms, even with a good config • Covered with the read path of this talk
  21. 21. From steady to loaded and overloaded • Number of concurrent tasks is a factor of • Number of cores • Number of disks • Number of remote machines used • Difficult to estimate • Queues are doomed to happen • hbase.regionserver.handler.count • So for low latency • Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code. • RPC Priorities: since 0.98 (HBASE-11048)
  22. 22. From loaded to overloaded • MemStore takes too much room: flush, then blocksquite quickly • • • hbase.hregion.memstore.block.multiplier • Too many Hfiles: block until compactions keep up • hbase.hstore.blockingStoreFiles • Too manyWALs files: Flush and block • hbase.regionserver.maxlogs
  23. 23. Machine failure • Failure • Dectect • Reallocate • Replay WAL • ReplayingWAL is NOT required for puts • hbase.master.distributed.log.replay • (default true in 1.0) • Failure = Dectect + Reallocate + Retry • That’s in the range of ~1s for simple failures • Silent failures leads puts you in the 10s range if the hardware does not help • zookeeper.session.timeout
  24. 24. Single puts • Millisecond range • Spikes do happen in steady mode • 100ms • Causes: GC, load, splits
  25. 25. Streaming puts Htable#setAutoFlushTo(false) Htable#put Htable#flushCommit • As simple puts, but • Puts are grouped and send in background • Load is taken into account • Does not block
  26. 26. Multiple puts (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1) • Decouple the client from a latency spike of a region server • Increase the throughput by 50% compared to old multiput • Makes split and GC more transparent
  27. 27. Conclusion on write path • Single puts can be very fast • It’s not a « hard real time » system: there are spikes • Most latency spikes can be hidden when streaming puts • Failure are NOT that difficult for the write path • No WAL to replay
  28. 28. And now for the read path
  29. 29. Read path • Get/short scan are assumed for low-latency operations • Again, two APIs • Single get: HTable#get(Get) • Multi-get: HTable#get(List<Get>) • Four stages, same as write path • Start (tcp connection, …) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system: you may need to add machines or tune your workload
  30. 30. Multi get / Client Group Gets by RegionServer Execute them one by one
  31. 31. Multi get / Server
  32. 32. Multi get / Server
  33. 33. AcceSstso rlaagtee hniecrya rmchya: gan diifdfeeresnt view Dean/2009 Memory is 100000x faster than disk! Disk seek = 10ms
  34. 34. Known unknowns • For each candidate HFile • Exclude by file metadata • Timestamp • Rowkey range • Exclude by bloom filter StoreFileScanner# shouldUseScanner()
  35. 35. Unknown knowns • Merge sort results polled from Stores • Seek each scanner to a reference KeyValue • Retrieve candidate data from disk • Multiple HFiles => mulitple seeks • • Short Circuit Reads • • Block locality • Happy clusters compact! HFileBlock# readBlockData()
  36. 36. BlockCache • Reuse previously read data • Maximize cache hit rate • Larger cache • Temporal access locality • Physical access locality BlockCache#getBlock()
  37. 37. BlockCache Showdown • LruBlockCache • Default, onheap • Quite good most of the time • Evictions impact GC • BucketCache • Offheap alternative • Serialization overhead • Large memory configurations cache-showdown/ L2 off-heap BucketCache makes a strong showing
  38. 38. Latency enemies: Garbage Collection • Use heap. Not too much. With CMS. • Max heap • 30GB (compressed pointers) • 8-16GB if you care about 9’s • Healthy cluster load • regular, reliable collections • 25-100ms pause on regular interval • Overloaded RegionServer suffers GC overmuch
  39. 39. Off-heap to the rescue? • BucketCache (0.96, HBASE-7404) • Network interfaces (HBASE-9535) • MemStore et al (HBASE-10191)
  40. 40. Latency enemies: Compactions • Fewer HFiles => fewer seeks • Evict data blocks! • Evict Index blocks!! • hfile.block.index.cacheonwrite • Evict bloom blocks!!! • hfile.block.bloom.cacheonwrite • OS buffer cache to the rescue • Compactected data is still fresh • Better than going all the way back to disk
  41. 41. Failure • Detect + Reassign + Replay • Strong consistency requires replay • Locality drops to 0 • Cache starts from scratch
  42. 42. Hedging our bets • HDFS Hedged reads (2.4, HDFS-5776) • Reads on secondary DataNodes • Strongly consistent • Works at the HDFS level • Timeline consistency (HBASE-10070) • Reads on « Replica Region » • Not strongly consistent
  43. 43. Read latency in summary • Steady mode • Cache hit: < 1 ms • Cache miss: + 10 ms per seek • Writing while reading => cache churn • GC: 25-100ms pause on regular interval Network request + (1 - P(cache hit)) * (10 ms * seeks) • Same long tail issues as write • Overloaded: same scheduling issues as write • Partial failures hurt a lot
  44. 44. HBase ranges for 99% latency Put Streamed Multiput Get Timeline get Steady milliseconds milliseconds milliseconds milliseconds Failure seconds seconds seconds milliseconds GC 10’s of milliseconds milliseconds 10’s of milliseconds milliseconds
  45. 45. What’s next • Less GC • Use less objects • Offheap ✓Compressed BlockCache (HBASE-11331) • Prefered location (HBASE-4755) • The « magical 1% » • Most tools stops at the 99% latency • What happens after is much more complex
  46. 46. 35.0x 30.0x 25.0x 20.0x 15.0x 10.0x 5.0x 0.0x Performance with Compressed BlockCache Total RAM: 24G LruBlockCache Size: 12G Data Size: 45G Compressed Size: 11G Compression: SNAPPY throughput (ops/sec) latency (ms, p95) latency (ms, p99) cpu load Times improvement Metric Enabled Disabled
  47. 47. Thanks! Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Strata New York, October 17, 2014