HBase Low Latency
Upcoming SlideShare
Loading in...5

HBase Low Latency






Total Views
Views on SlideShare
Embed Views



2 Embeds 151

http://www.scoop.it 150
http://www.slideee.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Random read/write database, latency is very important
  • Micro-seconds? Seconds? We are talking milliseconds <br /> Everyone stops looking after the 99% -- the literature calls this “magical 1%”
  • How to measure latency in HBase <br />
  • Client connects to region server with TCP connection <br /> Connection is shared by client threads <br /> Server manages lots of client connections <br /> Schedules client queries – synchronization, locks, queues, &c
  • WAL queue shared between regions <br /> - sometimes the sync work has already been done for you, can help <br /> - sometimes your small edit is sync’d with another larger one, can hurt you
  • Small cluster, 4-yo machines <br /> 1 put, 1 put, 1put… <br /> 99% is double the mean <br /> Servers doing other work, but nothing major
  • Where do we spend time? How about the network? <br /> 0.5ms is conservative, usually much less <br /> TCP round trip, same cluster as previous slide <br /> 99% 4x mean
  • Where do we spend time? How about HDFS? <br /> Flushing, writing, &c <br /> Just HDFS, 2.4, 1kb, flush, flush, flush <br /> 99% 5x mean – not bad <br />
  • What else? <br /> Millisecond world means minor things start to matter: <br /> - JVM 1.7 is better at blocking queues than 1.6 <br /> - forget to configure naggle? 40ms <br /> - older linux scheduler bugs, 50ms <br /> - facebook literature talks a lot about filesystems
  • Regular cluster operations also hurt <br /> Region split – 1’s of seconds <br /> Region move – 100ms (better on 0.96) <br /> GC – 10’s ms even after configuration tuning
  • Now add some load <br /> Load == concurrency == queues: 1ms <br /> 0.98 adds pluggable scheduler, so you can influence these queues yourself <br /> Work in progress to expose this through configuration
  • Contract: save you from cluster explosion <br /> How to protect? Stop writing <br /> - too many WALs? Stop writes. <br /> - too many hfiles to compact? Stop writes. <br /> Lasts indefinitely. New default settings.
  • Puts do not require reads, so WAL replay doesn’t block writes <br /> Simple crash: quickly detected, 1s <br /> Hung/frozen: <br /> - conservative detection takes longer <br /> - configured to 5-15s, but look like long GC pause
  • New since 0.96
  • Settings for average cluster, unaggressive client <br /> Decoupled client from issue of a single RS <br /> YSCB, single empty region, 50% better throughput (HBASE-6295)
  • This talk: assume get/short scan implies low-latency requirements <br /> No fancy streaming client like Puts; waiting for slowest RS.
  • Gets grouped by RS, groups sent in parallel, block until all groups return. <br /> Network call: 10’s micros, in parallel
  • Read path in full detail is quite complex <br /> See Lar’s HBaseCon 2012 talk for the nitty-gritty <br /> Complexity optimizing around one invariant: 10ms seek <br />
  • Read path in full detail is quite complex <br /> See Lar’s HBaseCon 2012 talk for the nitty-gritty <br /> Complexity optimizing around one invariant: 10ms seek <br />
  • Complexity optimizing around one invariant: 10ms seek <br /> Aiming for 100 microSec world; how to work around this? <br /> Goal: avoid disk at all cost! <br />
  • Goal: don’t go to disk unless absolutely necessary. <br /> Tactic: Candidate HFile elimination. <br /> Regular compactions => 3-5 candidates <br /> Bloom filters on by default in 0.96+
  • Mergesort over multiple files, multiple seeks <br /> More spindles = parallel scanning <br /> SCR avoids proxy process (DataNode) <br /> But remember! Goal: don’t go to disk unless absolutely necessary. <br />
  • “block” is a segment of an HFile <br /> Data blocks, index blocks, and bloom blocks <br /> Read blocks retained in BlockCache <br /> Seek to same and adjacent data become cache hits
  • HBase ships with a variety of cache implementations <br /> Happy with 95% stats? Stick with LruBlockCache and modest heapsize <br /> Pushing 99%? Lru still okay, but watch that heapsize. <br /> Spent money on RAM? BucketCache
  • GC is a part of healthy operation <br /> BlockCache garbage is awkward size and age, which means pauses <br /> Pause time is a function of heap size <br /> More like ~16GiB if you’re really worried about 99% <br /> Overloaded: more cache evictions, time in GC
  • Why generate garbage at all? <br /> GC are smart, but maybe we know our pain spots better? <br /> Don’t know until we try
  • Necessary for fewer scan candidates, fewer seeks <br /> Buffer cache to the rescue <br /> “That’s steady state and overloaded; let’s talk about failure”
  • Replaying WALs takes time <br /> Unlucky: no data locality, talking to remove DataNode <br /> Empty cache <br /> “Failure isn’t binary. What about the sick and the dying?”
  • Don’t wait for a slow machine!
  • Reads dominated by disk seek, so keep that data in memory <br /> After cache miss, GC is the next candidate cause of latency <br /> “Ideal formula” <br /> P(cache hit): fn (cache size :: db size, request locality) <br /> Sometimes the jitter dominates
  • Standard deployment, well designed schema <br /> Millisecond responses, seconds for failure recovery, and GC at a regular interval <br /> Everything we’ve focused on here is impactful of the 99% <br /> Beyond that there’s a lot of interesting problems to solve
  • There’s always more work to be done <br /> Generate less garbage <br /> Compressed BlockCache <br /> Improve recovery time and locality
  • Questions!

HBase Low Latency HBase Low Latency Presentation Transcript

  • HBase Low Latency Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Hadoop Summit June 4, 2014
  • Agenda • Latency, what is it, how to measure it • Write path • Read path • Next steps
  • What’s low latency Latency is about percentiles • Average != 50% percentile • There are often order of magnitudes between « average » and « 95 percentile » • Post 99% = « magical 1% ». Work in progress here. • Meaning from micro seconds (High Frequency Trading) to seconds (interactive queries) • In this talk milliseconds
  • Measure latency bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation • More options related to HBase: autoflush, replicas, … • Latency measured in micro second • Easier for internal analysis YCSB - Yahoo! Cloud Serving Benchmark • Useful for comparison between databases • Set of workload already defined
  • Write path • Two parts • Single put (WAL) • The client just sends the put • Multiple puts from the client (new behavior since 0.96) • The client is much smarter • Four stages to look at for latency • Start (establish tcp connections, etc.) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system
  • Single put: communication & scheduling • Client: TCP connection to the server • Shared: multitheads on the same client are using the same TCP connection • Pooling is possible and does improve the performances in some circonstances • hbase.client.ipc.pool.size • Server: multiple calls from multiple threads on multiple machines • Can become thousand of simultaneous queries • Scheduling is required
  • Single put: real work • The server must • Write into the WAL queue • Sync the WAL queue (HDFS flush) • Write into the memstore • WALs queue is shared between all the regions/handlers • Sync is avoided if another handlers did the work • Your handler may flush more data than expected
  • Simple put: A small run Percentile Time in ms Mean 1.21 50% 0.95 95% 1.50 99% 2.12
  • Latency sources • Candidate one: network • 0.5ms within a datacenter • Much less between nodes in the same rack Percentile Time in ms Mean 0.13 50% 0.12 95% 0.15 99% 0.47
  • Latency sources • Candidate two: HDFS Flush • We can still do better: HADOOP-7714 & sons. Percentile Time in ms Mean 0.33 50% 0.26 95% 0.59 99% 1.24
  • Latency sources • Millisecond world: everything can go wrong • JVM • Network • OS Scheduler • File System • All this goes into the post 99% percentile • Requires monitoring • Usually using the latest version helps
  • Latency sources • Split (and presplits) • Autosharding is great! • Puts have to wait • Impacts: seconds • Balance • Regions move • Triggers a retry for the client • hbase.client.pause = 100ms since HBase 0.96 • Garbage Collection • Impacts: 10’s of ms, even with a good config • Covered with the read path of this talk
  • From steady to loaded and overloaded • Number of concurrent tasks is a factor of • Number of cores • Number of disks • Number of remote machines used • Difficult to estimate • Queues are doomed to happen • hbase.regionserver.handler.count • So for low latency • Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code. • RPC Priorities: work in progress (HBASE-11048)
  • From loaded to overloaded • MemStore takes too much room: flush, then blocksquite quickly • hbase.regionserver.global.memstore.size.lower.limit • hbase.regionserver.global.memstore.size • hbase.hregion.memstore.block.multiplier • Too many Hfiles: block until compactions keep up • hbase.hstore.blockingStoreFiles • Too many WALs files: Flush and block • hbase.regionserver.maxlogs
  • Machine failure • Failure • Dectect • Reallocate • Replay WAL • Replaying WAL is NOT required for puts • hbase.master.distributed.log.replay • (default true in 1.0) • Failure = Dectect + Reallocate + Retry • That’s in the range of ~1s for simple failures • Silent failures leads puts you in the 10s range if the hardware does not help • zookeeper.session.timeout
  • Single puts • Millisecond range • Spikes do happen in steady mode • 100ms • Causes: GC, load, splits
  • Streaming puts Htable#setAutoFlushTo(false) Htable#put Htable#flushCommit • As simple puts, but • Puts are grouped and send in background • Load is taken into account • Does not block
  • Multiple puts hbase.client.max.total.tasks (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1) • Decouple the client from a latency spike of a region server • Increase the throughput by 50% compared to old multiput • Makes split and GC more transparent
  • Conclusion on write path • Single puts can be very fast • It’s not a « hard real time » system: there are spikes • Most latency spikes can be hidden when streaming puts • Failure are NOT that difficult for the write path • No WAL to replay
  • And now for the read path
  • Read path • Get/short scan are assumed for low-latency operations • Again, two APIs • Single get: HTable#get(Get) • Multi-get: HTable#get(List<Get>) • Four stages, same as write path • Start (tcp connection, …) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system: you may need to add machines or tune your workload
  • Multi get / Client Group Gets by RegionServer Execute them one by one
  • Multi get / Server
  • Multi get / Server
  • Access latency magnidesStorage hierarchy: a different view Dean/2009 Memory is 100000x faster than disk! Disk seek = 10ms
  • Known unknowns • For each candidate HFile • Exclude by file metadata • Timestamp • Rowkey range • Exclude by bloom filter StoreFileScanner# shouldUseScanner()
  • Unknown knowns • Merge sort results polled from Stores • Seek each scanner to a reference KeyValue • Retrieve candidate data from disk • Multiple HFiles => mulitple seeks • hbase.storescanner.parallel.seek.enable=true • Short Circuit Reads • dfs.client.read.shortcircuit=true • Block locality • Happy clusters compact! HFileBlock# readBlockData()
  • BlockCache • Reuse previously read data • Maximize cache hit rate • Larger cache • Temporal access locality • Physical access locality BlockCache#getBlock()
  • BlockCache Showdown • LruBlockCache • Default, onheap • Quite good most of the time • Evictions impact GC • BucketCache • Offheap alternative • Serialization overhead • Large memory configurations http://www.n10k.com/blog/block cache-showdown/ L2 off-heap BucketCache makes a strong showing
  • Latency enemies: Garbage Collection • Use heap. Not too much. With CMS. • Max heap • 30GB (compressed pointers) • 8-16GB if you care about 9’s • Healthy cluster load • regular, reliable collections • 25-100ms pause on regular interval • Overloaded RegionServer suffers GC overmuch
  • Off-heap to the rescue? • BucketCache (0.96, HBASE-7404) • Network interfaces (HBASE-9535) • MemStore et al (HBASE-10191)
  • Latency enemies: Compactions • Fewer HFiles => fewer seeks • Evict data blocks! • Evict Index blocks!! • hfile.block.index.cacheonwrite • Evict bloom blocks!!! • hfile.block.bloom.cacheonwrite • OS buffer cache to the rescue • Compactected data is still fresh • Better than going all the way back to disk
  • Failure • Detect + Reassign + Replay • Strong consistency requires replay • Locality drops to 0 • Cache starts from scratch
  • Hedging our bets • HDFS Hedged reads (2.4, HDFS-5776) • Reads on secondary DataNodes • Strongly consistent • Works at the HDFS level • Timeline consistency (HBASE-10070) • Reads on « Replica Region » • Not strongly consistent
  • Read latency in summary • Steady mode • Cache hit: < 1 ms • Cache miss: + 10 ms per seek • Writing while reading => cache churn • GC: 25-100ms pause on regular interval Network request + (1 - P(cache hit)) * (10 ms * seeks) • Same long tail issues as write • Overloaded: same scheduling issues as write • Partial failures hurt a lot
  • HBase ranges for 99% latency Put Streamed Multiput Get Timeline get Steady milliseconds milliseconds milliseconds milliseconds Failure seconds seconds seconds milliseconds GC 10’s of milliseconds milliseconds 10’s of milliseconds milliseconds
  • What’s next • Less GC • Use less objects • Offheap • Compressed BlockCache (HBASE-8894) • Prefered location (HBASE-4755) • The « magical 1% » • Most tools stops at the 99% latency • What happens after is much more complex
  • Thanks! Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Hadoop Summit June 4, 2014