Your SlideShare is downloading. ×
0
HBase Low Latency
Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
Hadoop Summit June 4, 2014
Agenda
• Latency, what is it, how to measure it
• Write path
• Read path
• Next steps
What’s low latency
Latency is about percentiles
• Average != 50% percentile
• There are often order of magnitudes between ...
Measure latency
bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation
• More options related to HBase: autoflush, replic...
Write path
• Two parts
• Single put (WAL)
• The client just sends the put
• Multiple puts from the client (new behavior si...
Single put: communication & scheduling
• Client: TCP connection to the server
• Shared: multitheads on the same client are...
Single put: real work
• The server must
• Write into the WAL queue
• Sync the WAL queue (HDFS flush)
• Write into the mems...
Simple put: A small run
Percentile Time in ms
Mean 1.21
50% 0.95
95% 1.50
99% 2.12
Latency sources
• Candidate one: network
• 0.5ms within a datacenter
• Much less between nodes in the same rack
Percentile...
Latency sources
• Candidate two: HDFS Flush
• We can still do better: HADOOP-7714 & sons.
Percentile Time in ms
Mean 0.33
...
Latency sources
• Millisecond world: everything can go wrong
• JVM
• Network
• OS Scheduler
• File System
• All this goes ...
Latency sources
• Split (and presplits)
• Autosharding is great!
• Puts have to wait
• Impacts: seconds
• Balance
• Region...
From steady to loaded and overloaded
• Number of concurrent tasks is a factor of
• Number of cores
• Number of disks
• Num...
From loaded to overloaded
• MemStore takes too much room: flush, then blocksquite quickly
• hbase.regionserver.global.mems...
Machine failure
• Failure
• Dectect
• Reallocate
• Replay WAL
• Replaying WAL is NOT required for puts
• hbase.master.dist...
Single puts
• Millisecond range
• Spikes do happen in steady mode
• 100ms
• Causes: GC, load, splits
Streaming puts
Htable#setAutoFlushTo(false)
Htable#put
Htable#flushCommit
• As simple puts, but
• Puts are grouped and sen...
Multiple puts
hbase.client.max.total.tasks (default 100)
hbase.client.max.perserver.tasks (default 5)
hbase.client.max.per...
Conclusion on write path
• Single puts can be very fast
• It’s not a « hard real time » system: there are spikes
• Most la...
And now for the read path
Read path
• Get/short scan are assumed for low-latency operations
• Again, two APIs
• Single get: HTable#get(Get)
• Multi-...
Multi get / Client
Group Gets by
RegionServer
Execute them
one by one
Multi get / Server
Multi get / Server
Access latency magnidesStorage hierarchy: a different view
Dean/2009
Memory is 100000x
faster than disk!
Disk seek = 10ms
Known unknowns
• For each candidate HFile
• Exclude by file metadata
• Timestamp
• Rowkey range
• Exclude by bloom filter
...
Unknown knowns
• Merge sort results polled from Stores
• Seek each scanner to a reference KeyValue
• Retrieve candidate da...
BlockCache
• Reuse previously read data
• Maximize cache hit rate
• Larger cache
• Temporal access locality
• Physical acc...
BlockCache Showdown
• LruBlockCache
• Default, onheap
• Quite good most of the time
• Evictions impact GC
• BucketCache
• ...
Latency enemies: Garbage Collection
• Use heap. Not too much. With CMS.
• Max heap
• 30GB (compressed pointers)
• 8-16GB i...
Off-heap to the rescue?
• BucketCache (0.96, HBASE-7404)
• Network interfaces (HBASE-9535)
• MemStore et al (HBASE-10191)
Latency enemies: Compactions
• Fewer HFiles => fewer seeks
• Evict data blocks!
• Evict Index blocks!!
• hfile.block.index...
Failure
• Detect + Reassign + Replay
• Strong consistency requires replay
• Locality drops to 0
• Cache starts from scratch
Hedging our bets
• HDFS Hedged reads (2.4, HDFS-5776)
• Reads on secondary DataNodes
• Strongly consistent
• Works at the ...
Read latency in summary
• Steady mode
• Cache hit: < 1 ms
• Cache miss: + 10 ms per seek
• Writing while reading => cache ...
HBase ranges for 99% latency
Put
Streamed
Multiput Get Timeline get
Steady milliseconds milliseconds milliseconds millisec...
What’s next
• Less GC
• Use less objects
• Offheap
• Compressed BlockCache (HBASE-8894)
• Prefered location (HBASE-4755)
•...
Thanks!
Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
Hadoop Summit June 4, 2014
Upcoming SlideShare
Loading in...5
×

HBase Low Latency

1,540

Published on

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,540
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
31
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide
  • Random read/write database, latency is very important
  • Micro-seconds? Seconds? We are talking milliseconds
    Everyone stops looking after the 99% -- the literature calls this “magical 1%”
  • How to measure latency in HBase
  • Client connects to region server with TCP connection
    Connection is shared by client threads
    Server manages lots of client connections
    Schedules client queries – synchronization, locks, queues, &c
  • WAL queue shared between regions
    - sometimes the sync work has already been done for you, can help
    - sometimes your small edit is sync’d with another larger one, can hurt you
  • Small cluster, 4-yo machines
    1 put, 1 put, 1put…
    99% is double the mean
    Servers doing other work, but nothing major
  • Where do we spend time? How about the network?
    0.5ms is conservative, usually much less
    TCP round trip, same cluster as previous slide
    99% 4x mean
  • Where do we spend time? How about HDFS?
    Flushing, writing, &c
    Just HDFS, 2.4, 1kb, flush, flush, flush
    99% 5x mean – not bad
  • What else?
    Millisecond world means minor things start to matter:
    - JVM 1.7 is better at blocking queues than 1.6
    - forget to configure naggle? 40ms
    - older linux scheduler bugs, 50ms
    - facebook literature talks a lot about filesystems
  • Regular cluster operations also hurt
    Region split – 1’s of seconds
    Region move – 100ms (better on 0.96)
    GC – 10’s ms even after configuration tuning
  • Now add some load
    Load == concurrency == queues: 1ms
    0.98 adds pluggable scheduler, so you can influence these queues yourself
    Work in progress to expose this through configuration
  • Contract: save you from cluster explosion
    How to protect? Stop writing
    - too many WALs? Stop writes.
    - too many hfiles to compact? Stop writes.
    Lasts indefinitely. New default settings.
  • Puts do not require reads, so WAL replay doesn’t block writes
    Simple crash: quickly detected, 1s
    Hung/frozen:
    - conservative detection takes longer
    - configured to 5-15s, but look like long GC pause
  • New since 0.96
  • Settings for average cluster, unaggressive client
    Decoupled client from issue of a single RS
    YSCB, single empty region, 50% better throughput (HBASE-6295)
  • This talk: assume get/short scan implies low-latency requirements
    No fancy streaming client like Puts; waiting for slowest RS.
  • Gets grouped by RS, groups sent in parallel, block until all groups return.
    Network call: 10’s micros, in parallel
  • Read path in full detail is quite complex
    See Lar’s HBaseCon 2012 talk for the nitty-gritty
    Complexity optimizing around one invariant: 10ms seek
  • Read path in full detail is quite complex
    See Lar’s HBaseCon 2012 talk for the nitty-gritty
    Complexity optimizing around one invariant: 10ms seek
  • Complexity optimizing around one invariant: 10ms seek
    Aiming for 100 microSec world; how to work around this?
    Goal: avoid disk at all cost!
  • Goal: don’t go to disk unless absolutely necessary.
    Tactic: Candidate HFile elimination.
    Regular compactions => 3-5 candidates
    Bloom filters on by default in 0.96+
  • Mergesort over multiple files, multiple seeks
    More spindles = parallel scanning
    SCR avoids proxy process (DataNode)
    But remember! Goal: don’t go to disk unless absolutely necessary.
  • “block” is a segment of an HFile
    Data blocks, index blocks, and bloom blocks
    Read blocks retained in BlockCache
    Seek to same and adjacent data become cache hits
  • HBase ships with a variety of cache implementations
    Happy with 95% stats? Stick with LruBlockCache and modest heapsize
    Pushing 99%? Lru still okay, but watch that heapsize.
    Spent money on RAM? BucketCache
  • GC is a part of healthy operation
    BlockCache garbage is awkward size and age, which means pauses
    Pause time is a function of heap size
    More like ~16GiB if you’re really worried about 99%
    Overloaded: more cache evictions, time in GC
  • Why generate garbage at all?
    GC are smart, but maybe we know our pain spots better?
    Don’t know until we try
  • Necessary for fewer scan candidates, fewer seeks
    Buffer cache to the rescue
    “That’s steady state and overloaded; let’s talk about failure”
  • Replaying WALs takes time
    Unlucky: no data locality, talking to remove DataNode
    Empty cache
    “Failure isn’t binary. What about the sick and the dying?”
  • Don’t wait for a slow machine!
  • Reads dominated by disk seek, so keep that data in memory
    After cache miss, GC is the next candidate cause of latency
    “Ideal formula”
    P(cache hit): fn (cache size :: db size, request locality)
    Sometimes the jitter dominates
  • Standard deployment, well designed schema
    Millisecond responses, seconds for failure recovery, and GC at a regular interval
    Everything we’ve focused on here is impactful of the 99%
    Beyond that there’s a lot of interesting problems to solve
  • There’s always more work to be done
    Generate less garbage
    Compressed BlockCache
    Improve recovery time and locality
  • Questions!
  • Transcript of "HBase Low Latency"

    1. 1. HBase Low Latency Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Hadoop Summit June 4, 2014
    2. 2. Agenda • Latency, what is it, how to measure it • Write path • Read path • Next steps
    3. 3. What’s low latency Latency is about percentiles • Average != 50% percentile • There are often order of magnitudes between « average » and « 95 percentile » • Post 99% = « magical 1% ». Work in progress here. • Meaning from micro seconds (High Frequency Trading) to seconds (interactive queries) • In this talk milliseconds
    4. 4. Measure latency bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation • More options related to HBase: autoflush, replicas, … • Latency measured in micro second • Easier for internal analysis YCSB - Yahoo! Cloud Serving Benchmark • Useful for comparison between databases • Set of workload already defined
    5. 5. Write path • Two parts • Single put (WAL) • The client just sends the put • Multiple puts from the client (new behavior since 0.96) • The client is much smarter • Four stages to look at for latency • Start (establish tcp connections, etc.) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system
    6. 6. Single put: communication & scheduling • Client: TCP connection to the server • Shared: multitheads on the same client are using the same TCP connection • Pooling is possible and does improve the performances in some circonstances • hbase.client.ipc.pool.size • Server: multiple calls from multiple threads on multiple machines • Can become thousand of simultaneous queries • Scheduling is required
    7. 7. Single put: real work • The server must • Write into the WAL queue • Sync the WAL queue (HDFS flush) • Write into the memstore • WALs queue is shared between all the regions/handlers • Sync is avoided if another handlers did the work • Your handler may flush more data than expected
    8. 8. Simple put: A small run Percentile Time in ms Mean 1.21 50% 0.95 95% 1.50 99% 2.12
    9. 9. Latency sources • Candidate one: network • 0.5ms within a datacenter • Much less between nodes in the same rack Percentile Time in ms Mean 0.13 50% 0.12 95% 0.15 99% 0.47
    10. 10. Latency sources • Candidate two: HDFS Flush • We can still do better: HADOOP-7714 & sons. Percentile Time in ms Mean 0.33 50% 0.26 95% 0.59 99% 1.24
    11. 11. Latency sources • Millisecond world: everything can go wrong • JVM • Network • OS Scheduler • File System • All this goes into the post 99% percentile • Requires monitoring • Usually using the latest version helps
    12. 12. Latency sources • Split (and presplits) • Autosharding is great! • Puts have to wait • Impacts: seconds • Balance • Regions move • Triggers a retry for the client • hbase.client.pause = 100ms since HBase 0.96 • Garbage Collection • Impacts: 10’s of ms, even with a good config • Covered with the read path of this talk
    13. 13. From steady to loaded and overloaded • Number of concurrent tasks is a factor of • Number of cores • Number of disks • Number of remote machines used • Difficult to estimate • Queues are doomed to happen • hbase.regionserver.handler.count • So for low latency • Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code. • RPC Priorities: work in progress (HBASE-11048)
    14. 14. From loaded to overloaded • MemStore takes too much room: flush, then blocksquite quickly • hbase.regionserver.global.memstore.size.lower.limit • hbase.regionserver.global.memstore.size • hbase.hregion.memstore.block.multiplier • Too many Hfiles: block until compactions keep up • hbase.hstore.blockingStoreFiles • Too many WALs files: Flush and block • hbase.regionserver.maxlogs
    15. 15. Machine failure • Failure • Dectect • Reallocate • Replay WAL • Replaying WAL is NOT required for puts • hbase.master.distributed.log.replay • (default true in 1.0) • Failure = Dectect + Reallocate + Retry • That’s in the range of ~1s for simple failures • Silent failures leads puts you in the 10s range if the hardware does not help • zookeeper.session.timeout
    16. 16. Single puts • Millisecond range • Spikes do happen in steady mode • 100ms • Causes: GC, load, splits
    17. 17. Streaming puts Htable#setAutoFlushTo(false) Htable#put Htable#flushCommit • As simple puts, but • Puts are grouped and send in background • Load is taken into account • Does not block
    18. 18. Multiple puts hbase.client.max.total.tasks (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1) • Decouple the client from a latency spike of a region server • Increase the throughput by 50% compared to old multiput • Makes split and GC more transparent
    19. 19. Conclusion on write path • Single puts can be very fast • It’s not a « hard real time » system: there are spikes • Most latency spikes can be hidden when streaming puts • Failure are NOT that difficult for the write path • No WAL to replay
    20. 20. And now for the read path
    21. 21. Read path • Get/short scan are assumed for low-latency operations • Again, two APIs • Single get: HTable#get(Get) • Multi-get: HTable#get(List<Get>) • Four stages, same as write path • Start (tcp connection, …) • Steady: when expected conditions are met • Machine failure: expected as well • Overloaded system: you may need to add machines or tune your workload
    22. 22. Multi get / Client Group Gets by RegionServer Execute them one by one
    23. 23. Multi get / Server
    24. 24. Multi get / Server
    25. 25. Access latency magnidesStorage hierarchy: a different view Dean/2009 Memory is 100000x faster than disk! Disk seek = 10ms
    26. 26. Known unknowns • For each candidate HFile • Exclude by file metadata • Timestamp • Rowkey range • Exclude by bloom filter StoreFileScanner# shouldUseScanner()
    27. 27. Unknown knowns • Merge sort results polled from Stores • Seek each scanner to a reference KeyValue • Retrieve candidate data from disk • Multiple HFiles => mulitple seeks • hbase.storescanner.parallel.seek.enable=true • Short Circuit Reads • dfs.client.read.shortcircuit=true • Block locality • Happy clusters compact! HFileBlock# readBlockData()
    28. 28. BlockCache • Reuse previously read data • Maximize cache hit rate • Larger cache • Temporal access locality • Physical access locality BlockCache#getBlock()
    29. 29. BlockCache Showdown • LruBlockCache • Default, onheap • Quite good most of the time • Evictions impact GC • BucketCache • Offheap alternative • Serialization overhead • Large memory configurations http://www.n10k.com/blog/block cache-showdown/ L2 off-heap BucketCache makes a strong showing
    30. 30. Latency enemies: Garbage Collection • Use heap. Not too much. With CMS. • Max heap • 30GB (compressed pointers) • 8-16GB if you care about 9’s • Healthy cluster load • regular, reliable collections • 25-100ms pause on regular interval • Overloaded RegionServer suffers GC overmuch
    31. 31. Off-heap to the rescue? • BucketCache (0.96, HBASE-7404) • Network interfaces (HBASE-9535) • MemStore et al (HBASE-10191)
    32. 32. Latency enemies: Compactions • Fewer HFiles => fewer seeks • Evict data blocks! • Evict Index blocks!! • hfile.block.index.cacheonwrite • Evict bloom blocks!!! • hfile.block.bloom.cacheonwrite • OS buffer cache to the rescue • Compactected data is still fresh • Better than going all the way back to disk
    33. 33. Failure • Detect + Reassign + Replay • Strong consistency requires replay • Locality drops to 0 • Cache starts from scratch
    34. 34. Hedging our bets • HDFS Hedged reads (2.4, HDFS-5776) • Reads on secondary DataNodes • Strongly consistent • Works at the HDFS level • Timeline consistency (HBASE-10070) • Reads on « Replica Region » • Not strongly consistent
    35. 35. Read latency in summary • Steady mode • Cache hit: < 1 ms • Cache miss: + 10 ms per seek • Writing while reading => cache churn • GC: 25-100ms pause on regular interval Network request + (1 - P(cache hit)) * (10 ms * seeks) • Same long tail issues as write • Overloaded: same scheduling issues as write • Partial failures hurt a lot
    36. 36. HBase ranges for 99% latency Put Streamed Multiput Get Timeline get Steady milliseconds milliseconds milliseconds milliseconds Failure seconds seconds seconds milliseconds GC 10’s of milliseconds milliseconds 10’s of milliseconds milliseconds
    37. 37. What’s next • Less GC • Use less objects • Offheap • Compressed BlockCache (HBASE-8894) • Prefered location (HBASE-4755) • The « magical 1% » • Most tools stops at the 99% latency • What happens after is much more complex
    38. 38. Thanks! Nick Dimiduk, Hortonworks (@xefyr) Nicolas Liochon, Scaled Risk (@nkeywal) Hadoop Summit June 4, 2014
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×