In this session we cover the variety of garbage collection algorithms, with a strong focus on tracing garbage collectors. We discuss concurrent and parallel GC, and novel approaches such as Azul's Completely Concurrent Compacting Collector (C4) and IBM's Metronome real-time GC.
10. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Problems with Reference Counting
Performance overhead and contention
associated with modifying the reference count
Particularly bad for multi-core cache lines
Particularly bad for short-lived local variables
Cyclic structures cannot be released
Python provides an additional GC for reclaiming
cycles
Objective C provides a __weak keyword for
declaring pointers without retain
Memory reclamation is proportional to the
number of objects allocated by the program
13. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
How Are We Doing?
This notion of reachability allows us to reclaim
more memory and do it faster than reference
counting!
// C++
void f()
{
auto a = make_shared<A>(...);
a->work();
// 100 lines that
// don’t use a
}
a dies here
// C#
void f()
{
var a = new A(...);
a.Work();
// 100 lines that
// don’t use a
}
a can be
collected here
14. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Side Effect
In some situations, an object may be collected
while a method on it is still executing
class A {
public void foo() {
Thread.Sleep(1000); // does not use this
}
~A() {
// called to reclaim unmanaged resources held by A
// only after the object is deemed unreachable
}
}
16. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Finding Unused Objects
“Two-Color” Mark
Begin from the set of roots and traverse the
heap graph recursively
Naïve, two-phase memory traversal
TRAVERSE(o)
if marked(o) = TRUE return
marked(o) ← TRUE
for v in o’s fields do TRAVERSE(v) end
end
for r in ROOTS do TRAVERSE(*r) end
walk the heap linearly and reclaim all objects not marked
Miscellanea
17. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Finding Unused Objects
Tri-Color Mark
White objects are candidates for collection
Black objects were proven to be reachable
Grey objects have not been traversed yet
GREY ← { g : ∃r ∈ ROOTS, r → g }
WHITE ← all objects GREY
BLACK ← ∅
for g ∈ GREY do
BLACK ← BLACK ∪ * g +
GREY ← GREY ∪ * w ∈ WHITE : g → w + { g }
end
reclaim all WHITE objects
Miscellanea
18. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Stop The World?
Naïve garbage collectors suspend all mutator
threads during a collection
With multi-gigabyte heaps this creates
unacceptable multi-second pauses
Modern tracing collectors suspend mutator
threads selectively, only when necessary
Importantly, threads executing code outside the
runtime don’t need to be suspended as long as
it’s the case
20. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
How To Suspend Threads Safely?
(2) GlobalExceptionHandler(PEXCEPTION_POINTERS exc)
LONG
{
if (exc->ExceptionRecord->ExceptionCode ==
EXCEPTION_ACCESS_VIOLATION &&
exc->ExceptionRecord->Parameters[0] == pGuardPage)
{
SetEvent(hThreadEvents[nThisThread]);
WaitForSingleObject(hResumeEvent, ...);
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
21. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
How To Suspend Threads Safely?
(3)
To suspend all threads safely, mark the page
as read-only and wait for all threads to pause
when they hit the exception
void SuspendAllThreads()
{
ResetEvent(hResumeEvent);
VirtualProtect(pGuardPage, PAGE_READONLY, ...);
WaitForMultipleObjects(..., hThreadEvents, ...);
VirtualProtect(pGuardPage, PAGE_READWRITE, ...);
SetEvent(hResumeEvent);
}
22. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
On The Fly Updates?
The tri-color states can be updated on-the-fly,
without pausing the mutator threads
Newly allocated objects are marked black*
When a.f = b executes (write barrier):
If b is white and a is black, mark b as grey
Otherwise, do nothing
Occasional short pauses still necessary to clear
the grey set and reach a consistent state
* Strictly speaking, this is optional, because a new object should survive
only if it assigned to some field/root in a black/grey object
24. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Baker’s Implicit Collection
Objects have two linked list fields and a color
bit
To allocate, take an object from the free list
When the free list is empty, move all live objects
to a separate list and make all other objects
free implicitly
Doesn’t handle fragmentation but very cheap
for large objects (no copying) and doesn’t
require reference updates
0/1
0/1
0/1
25. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Compaction
Perfect compaction: requires multiple heap
traversals, expensive
Two-pointer compaction: cheaper, imperfect
ptr2
ptr1
ptr1 ← beginning of heap
ptr2 ← end of heap
while ptr1 ≠ ptr2 do
repeat ptr2 ← next(ptr2) until ptr2 ∊ BLACK
repeat ptr1 ← next(ptr1) until ptr1 ∊ WHITE
if size(*ptr2) ≤ size(*ptr1) then copy(ptr1, ptr2) end
end
Miscellanea
27. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Compaction Procedure
Compaction requires updating references
Can be combined with stop-the-world and
marking
stack ← ROOTS
for g in stack do
place forwarding pointer in g’s old location
move g to new location
for v in g’s fields do
if v was forwarded then update g.v else push v to stack end
end
end
Miscellanea
28. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Two-Finger Copy (Cheney)
copyptr
scanptr
copyptr, scanptr ← start of TO
for r in ROOTS do
if *r was forwarded then r ← (*r).fwdptr else copy(r) end
end
while scanptr < copyptr do
for v in FIELDS(scanptr) do
if v is in FROM and v was not forwarded then copy(v) end
end
procedure copy(p)
scanptr ← scanptr + size(*scanptr)
memcpy(copyptr, p, size(*p))
p, (*p).fwdptr ← copyptr
end
copyptr ← copyptr + size(*p)
end
FREE
29. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Concurrent Copying Compaction
Allow mutators to operate during the combined
mark-and-compact phase:
1. New objects are allocated in TO space
2. When a pointer to FROM space is read (read
barrier), it is immediately copied to TO space
before returning to the mutator
Per-pointer read barriers are super-expensive,
but can be approximated at VM page level
30. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Mostly Concurrent Compaction
After copying objects, resume application
threads without fixing up all references but
mark all heap pages as inaccessible
When a thread accesses an inaccessible page, it
traps and fixes up all references in that page
Additionally, a background fixer slowly fixes up
pages
Not
fixed up
Fixed
up
Free
32. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Field Copy in CoCo
Each field is copied and its status field is then
updated in a lock-free CAS operation on both
the field status and the field
RETRY: // attempt to copy field i to TOSPACE
field_value = *((unsigned*)wide_object) + 2*i + 1);
*((unsigned*)to_object) + i) = field_value;
if (!CAS(
((unsigned*)wide_object) + 2*i,
MAKE64FROM32S(STATUS_COPIED, field_value),
MAKE64FROM32S(STATUS_COPYPENDING, field_value)
)) goto RETRY;
33. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Azul Systems: Phase 1
Vega (2005): a custom chip and OS to run JVM
with a fully concurrent collector
1-cycle hardware instruction for read barrier
Fast user-mode traps for GC-protected pages (enter
and exit in 4-10 cycles)
Some instructions are marked as safepoints and
check a per-CPU safepoint interrupt flag
Azul Vega 3
Up to 864 cores and 768GB of RAM
34. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Azul Systems: Phase 2
Zing (2010): an enhanced Linux kernel to run
JVM with a fully concurrent collector on x86-64
hardware
Software LVB (Loaded Value Barrier) + self-healing
Old and young gen collections are concurrent and
simultaneous with special “remembered sets” for
both old-to-young and young-to-old refs
New virtual memory subsystem to support superhigh memory remapping rates
35. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Azul’s Pauseless GC
Mark phase:
Parallel mark that sets a bit on each object reference
to indicate it was marked through
Gathers liveness total for each 1MB page
New objects are created in untouchable pages
Relocate phase:
Sparse pages are protected from mutator access
Objects from sparse pages are moved, forwarding
information maintained outside the page
The physical page is immediately recycled, the
virtual page remains protected until the remap phase
37. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Azul’s Pauseless GC
Remap phase:
Touches each live ref with a read barrier
Virtual memory for the previously protected pages is
freed
Folded together with the next Mark phase!
Self-healing:
The read barrier fixes the reference (with a CAS) if
the target has moved
The read barrier takes an NMT-trap if the NMT bit for
a ref is wrong, makes sure the Mark phase is aware
of that ref, and uses a CAS to replace the object ref
38. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
What’s The Biggest Innovation
Here?
The page remap/protection logic!
Compaction does not require an additional semispace unlike classic copying collectors
Uses physical memory released by a compacted
page as the compaction target for the next page
The very rapid remap/protection rates require
either custom hardware or memory mapping
extensions
Jumping the gap from ~5GB remaps/sec to ~5TB
remaps/sec
40. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Parallel Work Distribution
Mark
Overpartition the roots into more chunks than
threads
Threads push new outgoing references to their local
queues, other threads can help by stealing
Copy/Compact
Each thread has a copy finger pointing to a relatively
large private area in TOSPACE to reduce contention
Forwarding pointer updated using a CAS operation
while multiple threads speculatively allocate space
for it in their private areas
41. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Generations
When measuring GC efficiency, we look at the
ratio Bytes Freed/Time Elapsed
In a multi-gigabyte stable heap, this metric can
be very bad, e.g. 1MB/5sec
80-98% of new objects die within 1M
instructions or 1MB of allocations
A large fraction of the ones that survive 1-2
collections survive many collections
42. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Dividing The Heap
Introduce multiple memory region for
generations of objects
.NET currently uses three generations
Gen 0 for the newest objects, gen 2 for the oldest
Typical .NET gen 0 budget: 1MB-16MB
Gen 2
On most GC runs, consider only gen 0 objects
Gen 1
for the white set and do not traverse grey
objects from higher generations
Typical stats: 1GB small allocs/sec, 0.5% GC time,
average gen 0 GC latency: 200μs
Gen 0
43. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Additional Tuning
Small allocations from the younger
generation(s) can be performed from threadlocal areas (TLAs) to reduce contention
Large allocations that don’t fit or deemed too
expensive to copy can be satisfied directly from
the older generation
.NET will never compact the large object heap
without explicit instruction
44. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Inter-Generational Relations
This design breaks if objects from gen 1/2
reference objects from gen 0
Write barrier to perform a.f = b:
// assume that a is in ECX, b is in EBX
cmp
ecx, dword ptr [gEndOfGeneration0]
jg
SKIP
mov
edx, ecx
shr
edx, 10
xor
eax, eax
cmpxchg
1, byte ptr [pCardTable + edx]
SKIP:
mov
dword ptr [ecx + OFFSET(f)], ebx
Range
Has ref to
gen 0?
0 – 1023
No
1024 – 2047
Yes
…
No
…
Yes
…
No
45. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Young and Old Collections
Gen 0/gen 1 collections are too short to make
concurrent, and involve copying
In gen 2 we can now make a tradeoff and
occasionally do a blocking compacting GC
We can even allow quick blocking gen 0/gen 1
collections during a gen 2 concurrent collection
Microsoft calls this “background” vs. “foreground” GC
46. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Modern Performance Problems
Excessive paging when doing full GC
Occasional long and unpredictable pauses
In a game, you usually have 16-33ms/frame
Steep performance decline when most of the
available memory is live
“We feel so strongly about ARC being the right
approach to memory management that we have
decided to deprecate Garbage Collection in OS
X.”
[Apple, WWDC 2012]
48. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Real-Time Garbage Collectors
Primary property: guarantee a certain %
utilization for your application in each time
period
Java Metronome (IBM WebSphere Real Time)
% java -Xgcpolicy:metronome -Xgc:targetUtilization=80
-Xgc:targetPauseTime=10 realtime_app
Uses a GC thread per processor, running in short
quanta based on utilization constraints and available
heap space
49. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Java Metronome
Usually acts like a concurrent mark/sweep
collector, no copying
Occasionally performs defragmentation with
copying in small, time-constrained units
Move is concurrent and relies on a read barrier
Super-optimized with resulting ~4% perf. hit
Arraylets: breaking large arrays into fixed-size
non-consecutive pieces to reduce scan and
copy overhead and fragmentation
50. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Miscellanea
Miscellaneous Optimizations
Value/primitive types (stack allocations)
Custom memory pools for specific uses (e.g.,
Android bitmap pool)
Escape analysis: compile-time transformation
from heap allocations to stack allocations
Partial compaction based on information from
sweep phase on heap segment utilization
Switching between expensive and cheap
read/write barriers depending on GC stage
51. Taxonomy
RC
M/S/C
Concurrent
Parallel
Comparing Some Modern
Collectors Nursery
Runtime
Collector
Old Gen
JVM
ParallelGC
STW Copy
Concurrent
STW Copy
Miscellanea
Remarks
Par. STW
M/S/C
JVM
Generations
Conc. Mark
STW Cmpct
JVM
optthruput
STW Copy
Conc. Par.
Mark
STW Cmpct
CLR
Conc. WKS
STW Copy
Conc. Mark
STW Cmpct
CLR
Server
STW Copy
Conc. Par.
Mark
STW Cmpct
JVM
C4 (Azul)
Conc. Cmpct
Conc. Cmpct
Ruby
Rubinius
STW Copy
Conc. M/S
No Cmpct
JVM
G1
STW Copy
Conc. Mark
Takes pause
52. Taxonomy
RC
M/S/C
Concurrent
Parallel
Generations
Finalization
Automatic reclamation of unmanaged
resources is somewhat of an afterthought
Associate an object with a finalizer:
class File
{
Guaranteed to be called
private IntPtr handle;
at some point after the
File object is no longer
// ...
reachable by the program
~File()
{
NativeMethods.CloseHandle(handle);
}
}
Miscellanea
58. References
Uniprocessor Garbage Collection Techniques [Wilson]
Mostly Concurrent Compaction for Mark-Sweep GC [Ossia et al.]
Concurrent Garbage Collection in Rubinius [Bussink]
Parallel Garbage Collection for Shared Memory Multiprocessors
[Flood et al.]
Qualifying the Performance of Garbage Collection vs. Explicit
Memory Management [Hertz et al.]
Why mobile web apps are slow [Crawford]
STOPLESS: A Real-Time Garbage Collector for Multiprocessors
[Pizlo et al.]
A real-time garbage collector with low overhead and consistent
utilization [Bacon et al.]
C4: The Continuously Concurrent Compacting Collector [Tene et
al.]
The Pauseless GC Algorithm [Click et al.]