3. Intro / who am I?
Been working on data stuff for a few years
HBase, HDFS, MR committer
Cloudera engineer since March ’09
4. Motivation
HBase users want to use large heaps
Bigger block caches make for better hit rates
Bigger memstores make for larger and more
efficient flushes
Machines come with 24G-48G RAM
But bigger heaps mean longer GC pauses
Around 10 seconds/GB on my boxes.
Several minute GC pauses wreak havoc
5. GC Disasters
1. Client requests stalled
1 minute “latency” is just as bad as unavailability
2. ZooKeeper sessions stop pinging
The dreaded “Juliet Pause” scenario
3. Triggers all kinds of other nasty bugs
6. Yo Concurrent
Mark-and-Sweep (CMS)!
What part of Concurrent didn’t
you understand?
7. Java GC Background
Java’s GC is generational
Generational hypothesis: most objects either die
young or stick around for quite a long time
Split the heap into two “generations” - young (aka
new) and old (aka tenured)
Use different algorithms for the two generations
We usually recommend -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
Young generation: Parallel New collector
Old generation: Concurrent-mark-sweep
8. The Parallel New collector in 60 seconds
Divide the young generation into eden,
survivor-0, and survivor-1
One survivor space is from-space and the other
is to-space
Allocate all objects in eden
When eden fills up, stop the world and copy
live objects from eden and from-space into
to-space, swap from and to
Once an object has been copied back and forth N
times, copy it to the old generation
N is the “Tenuring Threshold” (tunable)
9. The CMS collector in 60 seconds
A bit simplified, sorry...
Several phases:
1. initial-mark (stop-the-world) - marks roots (eg
thread stacks)
2. concurrent-mark - traverse references starting at
roots, marking what’s live
3. concurrent-preclean - another pass of the same
(catch new objects)
4. remark (stop-the-world) - any last changed/new
objects
5. concurrent-sweep - clean up dead objects to
update free space tracking
Note: dead objects free up space, but it’s not
contiguous. We’ll come back to this later!
10. CMS failure modes
1. When young generation collection happens, it
needs space in the old gen. What if CMS is
already in the middle of concurrent work, but
there’s no space?
The dreaded concurrent mode failure! Stop
the world and collect.
Solution: lower value of
-XX:CMSInitiatingOccupancyFraction so
CMS starts working earlier
2. What if there’s space in the old generation, but
not enough contiguous space to promote a
large object?
We need to compact the old generation (move all
free space to be contiguous)
This is also stop-the-world! Kaboom!
12. Step 1. Hypothesize
Setting the initiating occupancy fraction low
puts off GC, but it eventually happens no
matter what
We see promotion failed followed by long
GC pause, even when 30% of the heap is free.
Why? Must be fragmentation!
13. Step 2. Measure
Let’s make some graphs:
-XX:PrintFLSStatistics=1
-XX:PrintCMSStatistics=1
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps -verbose:gc
-Xloggc:/.../logs/gc-$(hostname).log
FLS Statistics: verbose information about the
state of the free space inside the old generation
Free space - total amount of free space
Num blocks - number of fragments it’s spread into
Max chunk size
parse-fls-statistics.py → R and
ggplot2
17. Workload 3
Read-only with no cache churn
So boring I didn’t make a graph!
All allocations are short lived → stay in young
gen
18. Recap
What we have learned?
Fragmentation is what causes long GC pauses
Write load seems to cause fragmentation
Read load (LRU cache churn) isn’t nearly so
bad1
1
At least for my test workloads
19. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
20. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
A
21. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
AB
22. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
ABC
23. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
ABCD
24. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
ABCDE
25. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
ABCDEABCEDDAECBACEBCED
26. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
ABCDEABCEDDAECBACEBCED
Now B’s memstore fills up and flushes. We’re
left with:
27. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
ABCDEABCEDDAECBACEBCED
Now B’s memstore fills up and flushes. We’re
left with:
A CDEA CEDDAEC ACE CED
28. Taking a step back
Why does write load cause fragmentation?
Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:
ABCDEABCEDDAECBACEBCED
Now B’s memstore fills up and flushes. We’re
left with:
A CDEA CEDDAEC ACE CED
Looks like fragmentation!
29. Also known as swiss cheese
If every write is exactly the same size, it’s fine -
we’ll fill in those holes. But this is seldom true.
30. A solution
Crucial issue is that memory allocations for a
given memstore aren’t next to each other in
the old generation.
When we free an entire memstore we only get
tiny blocks of free space
What if we ensure that the memory for a
memstore is made of large blocks?
Enter the MemStore Local Allocation Buffer
(MSLAB)
31. What’s an MSLAB?
Each MemStore has an instance of
MemStoreLAB.
MemStoreLAB has a 2MB curChunk with
nextFreeOffset starting at 0.
Before inserting a KeyValue that points to
some byte[], copy the data into curChunk
and increment nextFreeOffset by data.length
Insert a KeyValue pointing inside curChunk
instead of the original data.
If a chunk fills up, just make a new one.
This is all lock-free, using atomic
compare-and-swap instructions.
32. How does this help?
The original data to be inserted becomes very
short-lived, and dies in the young generation.
The only data in the old generation is made of
2MB chunks
Each chunk only belongs to one memstore.
When we flush, we always free up 2MB chunks,
and avoid the swiss cheese effect.
Next time we allocate, we need exactly 2MB
chunks again, and there will definitely be space.
34. It works!
Have seen basically zero full
GCs with MSLAB enabled,
after days of load testing
35. Summary
Most GC pauses are caused by fragmentation
in the old generation.
The CMS collector doesn’t compact, so the
only way it can fight fragmentation is to pause.
The MSLAB moves all MemStore allocations
into contiguous 2MB chunks in the old
generation.
No more GC pauses!
36. How to try it
1. Upgrade to HBase 0.90.1 (included in
CDH3b4)
2. Set hbase.hregion.memstore.mslab.enabled to
true
Also tunable:
hbase.hregion.memstore.mslab.chunksize
(in bytes, default 2M)
hbase.hregion.memstore.mslab.max.allocation
(in bytes, default 256K)
3. Report back your results!
37. Future work
Flat 2MB chunk per region → 2GB RAM
minimum usage for 1000 regions
incrementColumnValue currently bypasses
MSLAB for subtle reasons
We’re doing an extra memory copy into
MSLAB chunk - we can optimize this out
Maybe we can relax
CMSInitiatingOccupancyFraction back up
a bit?
38. So I don’t forget...
Corporate shill time
Cloudera offering HBase training on March 10th.
15 percent off with hbase meetup code.