• Like
  • Save

HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers

  • 18,524 views
Uploaded on

Cloudera's Todd Lipcon's presentation slides for the HBase HUG, "Avoiding Full GCs with MemStore-Local Allocation Buffers."

Cloudera's Todd Lipcon's presentation slides for the HBase HUG, "Avoiding Full GCs with MemStore-Local Allocation Buffers."

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Very insightful and helpful
    Are you sure you want to
    Your message goes here
  • why I canot download it?
    Are you sure you want to
    Your message goes here
  • mark
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
18,524
On Slideshare
0
From Embeds
0
Number of Embeds
13

Actions

Shares
Downloads
0
Comments
3
Likes
73

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Avoiding Full GCs withMemStore-Local Allocation Buffers Todd Lipcon todd@cloudera.comTwitter: @tlipcon #hbase IRC: tlipcon February 22, 2011
  • 2. Outline Background HBase and GC A solution Summary
  • 3. Intro / who am I? Been working on data stuff for a few years HBase, HDFS, MR committer Cloudera engineer since March ’09
  • 4. Motivation HBase users want to use large heaps Bigger block caches make for better hit rates Bigger memstores make for larger and more efficient flushes Machines come with 24G-48G RAM But bigger heaps mean longer GC pauses Around 10 seconds/GB on my boxes. Several minute GC pauses wreak havoc
  • 5. GC Disasters 1. Client requests stalled 1 minute “latency” is just as bad as unavailability 2. ZooKeeper sessions stop pinging The dreaded “Juliet Pause” scenario 3. Triggers all kinds of other nasty bugs
  • 6. Yo Concurrent Mark-and-Sweep (CMS)!What part of Concurrent didn’t you understand?
  • 7. Java GC Background Java’s GC is generational Generational hypothesis: most objects either die young or stick around for quite a long time Split the heap into two “generations” - young (aka new) and old (aka tenured) Use different algorithms for the two generations We usually recommend -XX:+UseParNewGC -XX:+UseConcMarkSweepGC Young generation: Parallel New collector Old generation: Concurrent-mark-sweep
  • 8. The Parallel New collector in 60 seconds Divide the young generation into eden, survivor-0, and survivor-1 One survivor space is from-space and the other is to-space Allocate all objects in eden When eden fills up, stop the world and copy live objects from eden and from-space into to-space, swap from and to Once an object has been copied back and forth N times, copy it to the old generation N is the “Tenuring Threshold” (tunable)
  • 9. The CMS collector in 60 secondsA bit simplified, sorry... Several phases: 1. initial-mark (stop-the-world) - marks roots (eg thread stacks) 2. concurrent-mark - traverse references starting at roots, marking what’s live 3. concurrent-preclean - another pass of the same (catch new objects) 4. remark (stop-the-world) - any last changed/new objects 5. concurrent-sweep - clean up dead objects to update free space tracking Note: dead objects free up space, but it’s not contiguous. We’ll come back to this later!
  • 10. CMS failure modes 1. When young generation collection happens, it needs space in the old gen. What if CMS is already in the middle of concurrent work, but there’s no space? The dreaded concurrent mode failure! Stop the world and collect. Solution: lower value of -XX:CMSInitiatingOccupancyFraction so CMS starts working earlier 2. What if there’s space in the old generation, but not enough contiguous space to promote a large object? We need to compact the old generation (move all free space to be contiguous) This is also stop-the-world! Kaboom!
  • 11. OK... so life sucks.What can we do about it?
  • 12. Step 1. Hypothesize Setting the initiating occupancy fraction low puts off GC, but it eventually happens no matter what We see promotion failed followed by long GC pause, even when 30% of the heap is free. Why? Must be fragmentation!
  • 13. Step 2. Measure Let’s make some graphs: -XX:PrintFLSStatistics=1 -XX:PrintCMSStatistics=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -Xloggc:/.../logs/gc-$(hostname).log FLS Statistics: verbose information about the state of the free space inside the old generation Free space - total amount of free space Num blocks - number of fragments it’s spread into Max chunk size parse-fls-statistics.py → R and ggplot2
  • 14. 3 YCSB workloads, graphed
  • 15. Workload 1Insert-only
  • 16. Workload 2Read-only with cache churn
  • 17. Workload 3Read-only with no cache churn So boring I didn’t make a graph! All allocations are short lived → stay in young gen
  • 18. RecapWhat we have learned? Fragmentation is what causes long GC pauses Write load seems to cause fragmentation Read load (LRU cache churn) isn’t nearly so bad1 1 At least for my test workloads
  • 19. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation:
  • 20. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: A
  • 21. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: AB
  • 22. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABC
  • 23. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCD
  • 24. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDE
  • 25. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDEABCEDDAECBACEBCED
  • 26. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDEABCEDDAECBACEBCED Now B’s memstore fills up and flushes. We’re left with:
  • 27. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDEABCEDDAECBACEBCED Now B’s memstore fills up and flushes. We’re left with: A CDEA CEDDAEC ACE CED
  • 28. Taking a step backWhy does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDEABCEDDAECBACEBCED Now B’s memstore fills up and flushes. We’re left with: A CDEA CEDDAEC ACE CED Looks like fragmentation!
  • 29. Also known as swiss cheeseIf every write is exactly the same size, it’s fine -we’ll fill in those holes. But this is seldom true.
  • 30. A solution Crucial issue is that memory allocations for a given memstore aren’t next to each other in the old generation. When we free an entire memstore we only get tiny blocks of free space What if we ensure that the memory for a memstore is made of large blocks? Enter the MemStore Local Allocation Buffer (MSLAB)
  • 31. What’s an MSLAB? Each MemStore has an instance of MemStoreLAB. MemStoreLAB has a 2MB curChunk with nextFreeOffset starting at 0. Before inserting a KeyValue that points to some byte[], copy the data into curChunk and increment nextFreeOffset by data.length Insert a KeyValue pointing inside curChunk instead of the original data. If a chunk fills up, just make a new one. This is all lock-free, using atomic compare-and-swap instructions.
  • 32. How does this help? The original data to be inserted becomes very short-lived, and dies in the young generation. The only data in the old generation is made of 2MB chunks Each chunk only belongs to one memstore. When we flush, we always free up 2MB chunks, and avoid the swiss cheese effect. Next time we allocate, we need exactly 2MB chunks again, and there will definitely be space.
  • 33. Does it work?
  • 34. It works! Have seen basically zero full GCs with MSLAB enabled, after days of load testing
  • 35. Summary Most GC pauses are caused by fragmentation in the old generation. The CMS collector doesn’t compact, so the only way it can fight fragmentation is to pause. The MSLAB moves all MemStore allocations into contiguous 2MB chunks in the old generation. No more GC pauses!
  • 36. How to try it 1. Upgrade to HBase 0.90.1 (included in CDH3b4) 2. Set hbase.hregion.memstore.mslab.enabled to true Also tunable: hbase.hregion.memstore.mslab.chunksize (in bytes, default 2M) hbase.hregion.memstore.mslab.max.allocation (in bytes, default 256K) 3. Report back your results!
  • 37. Future work Flat 2MB chunk per region → 2GB RAM minimum usage for 1000 regions incrementColumnValue currently bypasses MSLAB for subtle reasons We’re doing an extra memory copy into MSLAB chunk - we can optimize this out Maybe we can relax CMSInitiatingOccupancyFraction back up a bit?
  • 38. So I don’t forget...Corporate shill time Cloudera offering HBase training on March 10th. 15 percent off with hbase meetup code.
  • 39. todd@cloudera.com Twitter: @tlipcon#hbase IRC: tlipcon P.S. we’re hiring!