A History of Modern Garbage Collection Techniques

11,900 views

Published on

In this session we cover the variety of garbage collection algorithms, with a strong focus on tracing garbage collectors. We discuss concurrent and parallel GC, and novel approaches such as Azul's Completely Concurrent Compacting Collector (C4) and IBM's Metronome real-time GC.

Published in: Technology

A History of Modern Garbage Collection Techniques

  1. 1. Sasha Goldshtein CTO, Sela Group @goldshtn blog.sashag.net Modern Garbage Collection in Theory and Practice © Copyright SELA software & Education Labs Ltd. | 14-18 Baruch Hirsch St Bnei Brak, 51202 Israel | www.selagroup.com
  2. 2. Two-generational parallel garbage collection with a heap compaction phase. Source: funtastica (Flickr) under CC BY-NC-SA 2.0
  3. 3. Agenda Automatic reference counting Tracing garbage collection Reachability and tri-color mark Sweep and compaction Concurrent GC Copying and generational GC Parallel GC Realtime GC Finalization
  4. 4. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Garbage Collection Automatic memory management In broadest terms: no need to manually reclaim memory that is no longer used
  5. 5. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Taxonomy of Garbage Collectors Reference counting • Automatic (iOS, Python) • Library-based (C++, COM) Tracing • Mark-sweep • Copying (Lisp) • Generational (Java, .NET, Ruby)
  6. 6. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Library-Based Reference Counting template <typename T> class my_shared_ptr { struct rep { T* p_; unsigned int rc_; } * rep_; public: my_shared_ptr(T* p) : rep_(new rep { p, 1 }) {} my_shared_ptr(const my_shared_ptr& other) { rep_ = other.rep_; ++(rep_->rc_); } my_shared_ptr& operator=(const my_shared_pointer& other) ... T& operator*() { return *(rep_->p_); } T* operator->() { return rep_->p_; } ~my_shared_ptr() { if (0 == --(rep_->rc_)) ... } }; // This is super-simplified and incomplete
  7. 7. Taxonomy RC M/S/C Concurrent Parallel Generations Automatic Reference Counting Used by Apple since iOS 4 with the LLVM Objective-C compiler -(NSString *)foo:(NSString *)str { NSString *temp = [str lowercaseString]; _field = temp; return temp; } [obj foo:str]; Miscellanea
  8. 8. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Automatic Reference Counting The compiler inserts release/retain messages to maintain the reference count -(NSString *)foo:(NSString *)str { NSString *temp = [str lowercaseString]; _field = [temp retain]; [str release]; return temp; } [obj foo:[str retain]];
  9. 9. Taxonomy RC M/S/C Concurrent Parallel Generations Deferred Reference Counting To decrease the overhead, stack variables pointing to objects do not increase their reference count When GC runs, stack variables must be traversed Not a huge improvement Miscellanea
  10. 10. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Problems with Reference Counting Performance overhead and contention associated with modifying the reference count Particularly bad for multi-core cache lines Particularly bad for short-lived local variables Cyclic structures cannot be released Python provides an additional GC for reclaiming cycles Objective C provides a __weak keyword for declaring pointers without retain Memory reclamation is proportional to the number of objects allocated by the program
  11. 11. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Tracing Garbage Collection Does not maintain a reference count for each object Instead, the garbage collector kicks in at unspecified times and reclaims unused objects
  12. 12. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Reachability An object is reachable if it is directly pointed to by a root Active local variables and parameters on the stack CPU registers Static variables An object is reachable if it is referenced by a reachable object
  13. 13. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea How Are We Doing? This notion of reachability allows us to reclaim more memory and do it faster than reference counting! // C++ void f() { auto a = make_shared<A>(...); a->work(); // 100 lines that // don’t use a } a dies here // C# void f() { var a = new A(...); a.Work(); // 100 lines that // don’t use a } a can be collected here
  14. 14. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Side Effect In some situations, an object may be collected while a method on it is still executing class A { public void foo() { Thread.Sleep(1000); // does not use this } ~A() { // called to reclaim unmanaged resources held by A // only after the object is deemed unreachable } }
  15. 15. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Runtime Information Precise reachability analysis requires a few things C/C++ can’t easily do: 1. Size and field information for each heap object 2. Type information for stack locations and CPU registers 3. No “pointer smuggling” 4. Only one address per object …which is why most GCs ship with “managed” languages
  16. 16. Taxonomy RC M/S/C Concurrent Parallel Generations Finding Unused Objects “Two-Color” Mark Begin from the set of roots and traverse the heap graph recursively Naïve, two-phase memory traversal TRAVERSE(o) if marked(o) = TRUE return marked(o) ← TRUE for v in o’s fields do TRAVERSE(v) end end for r in ROOTS do TRAVERSE(*r) end walk the heap linearly and reclaim all objects not marked Miscellanea
  17. 17. Taxonomy RC M/S/C Concurrent Parallel Generations Finding Unused Objects Tri-Color Mark White objects are candidates for collection Black objects were proven to be reachable Grey objects have not been traversed yet GREY ← { g : ∃r ∈ ROOTS, r → g } WHITE ← all objects GREY BLACK ← ∅ for g ∈ GREY do BLACK ← BLACK ∪ * g + GREY ← GREY ∪ * w ∈ WHITE : g → w + { g } end reclaim all WHITE objects Miscellanea
  18. 18. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Stop The World? Naïve garbage collectors suspend all mutator threads during a collection With multi-gigabyte heaps this creates unacceptable multi-second pauses Modern tracing collectors suspend mutator threads selectively, only when necessary Importantly, threads executing code outside the runtime don’t need to be suspended as long as it’s the case
  19. 19. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea How To Suspend Threads Safely? (1) The compiler occasionally inserts suspension request checks, known as safe points for (int k = 0; k < 100000; ++k) { LengthyCalculation(k); mov byte ptr [pGuardPage], 0 }
  20. 20. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea How To Suspend Threads Safely? (2) GlobalExceptionHandler(PEXCEPTION_POINTERS exc) LONG { if (exc->ExceptionRecord->ExceptionCode == EXCEPTION_ACCESS_VIOLATION && exc->ExceptionRecord->Parameters[0] == pGuardPage) { SetEvent(hThreadEvents[nThisThread]); WaitForSingleObject(hResumeEvent, ...); return EXCEPTION_CONTINUE_EXECUTION; } return EXCEPTION_CONTINUE_SEARCH; }
  21. 21. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea How To Suspend Threads Safely? (3) To suspend all threads safely, mark the page as read-only and wait for all threads to pause when they hit the exception void SuspendAllThreads() { ResetEvent(hResumeEvent); VirtualProtect(pGuardPage, PAGE_READONLY, ...); WaitForMultipleObjects(..., hThreadEvents, ...); VirtualProtect(pGuardPage, PAGE_READWRITE, ...); SetEvent(hResumeEvent); }
  22. 22. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea On The Fly Updates? The tri-color states can be updated on-the-fly, without pausing the mutator threads Newly allocated objects are marked black* When a.f = b executes (write barrier): If b is white and a is black, mark b as grey Otherwise, do nothing Occasional short pauses still necessary to clear the grey set and reach a consistent state * Strictly speaking, this is optional, because a new object should survive only if it assigned to some field/root in a black/grey object
  23. 23. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Reclaiming Memory Mark-and-sweep All unreachable objects are added to a free list Often the list is threaded through the object headers U F U U F F U U U F Mark-and-compact Reachable objects are shifted together in memory Makes heap allocation super-cheap: no free list! Reachable Free
  24. 24. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Baker’s Implicit Collection Objects have two linked list fields and a color bit To allocate, take an object from the free list When the free list is empty, move all live objects to a separate list and make all other objects free implicitly Doesn’t handle fragmentation but very cheap for large objects (no copying) and doesn’t require reference updates 0/1 0/1 0/1
  25. 25. Taxonomy RC M/S/C Concurrent Parallel Generations Compaction Perfect compaction: requires multiple heap traversals, expensive Two-pointer compaction: cheaper, imperfect ptr2 ptr1 ptr1 ← beginning of heap ptr2 ← end of heap while ptr1 ≠ ptr2 do repeat ptr2 ← next(ptr2) until ptr2 ∊ BLACK repeat ptr1 ← next(ptr1) until ptr1 ∊ WHITE if size(*ptr2) ≤ size(*ptr1) then copy(ptr1, ptr2) end end Miscellanea
  26. 26. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Compaction By Copying Two semi-spaces: FROM and TO Allocations satisfied in FROM space GC copies all surviving objects to the TO space and swaps the meaning of each space FROM TO TO FROM U (1) F U (2) GC U (4) U (3) U (3) F U (2) U (4) U (1)
  27. 27. Taxonomy RC M/S/C Concurrent Parallel Generations Compaction Procedure Compaction requires updating references Can be combined with stop-the-world and marking stack ← ROOTS for g in stack do place forwarding pointer in g’s old location move g to new location for v in g’s fields do if v was forwarded then update g.v else push v to stack end end end Miscellanea
  28. 28. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Two-Finger Copy (Cheney) copyptr scanptr copyptr, scanptr ← start of TO for r in ROOTS do if *r was forwarded then r ← (*r).fwdptr else copy(r) end end while scanptr < copyptr do for v in FIELDS(scanptr) do if v is in FROM and v was not forwarded then copy(v) end end procedure copy(p) scanptr ← scanptr + size(*scanptr) memcpy(copyptr, p, size(*p)) p, (*p).fwdptr ← copyptr end copyptr ← copyptr + size(*p) end FREE
  29. 29. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Concurrent Copying Compaction Allow mutators to operate during the combined mark-and-compact phase: 1. New objects are allocated in TO space 2. When a pointer to FROM space is read (read barrier), it is immediately copied to TO space before returning to the mutator Per-pointer read barriers are super-expensive, but can be approximated at VM page level
  30. 30. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Mostly Concurrent Compaction After copying objects, resume application threads without fixing up all references but mark all heap pages as inaccessible When a thread accesses an inaccessible page, it traps and fixes up all references in that page Additionally, a background fixer slowly fixes up pages Not fixed up Fixed up Free
  31. 31. Taxonomy RC M/S/C Concurrent Parallel Generations Fully Concurrent Compaction (CoCo) Miscellanea Move objects in two phases using lock-free perfield updates to let mutators see the up to date state FROMSPACE object Hdr Field 1 Field 2 Wide object Hdr Hdr Fld 1 Status Field 1 Fld 2 Status Field 1 Field 2 TOSPACE object Field 2
  32. 32. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Field Copy in CoCo Each field is copied and its status field is then updated in a lock-free CAS operation on both the field status and the field RETRY: // attempt to copy field i to TOSPACE field_value = *((unsigned*)wide_object) + 2*i + 1); *((unsigned*)to_object) + i) = field_value; if (!CAS( ((unsigned*)wide_object) + 2*i, MAKE64FROM32S(STATUS_COPIED, field_value), MAKE64FROM32S(STATUS_COPYPENDING, field_value) )) goto RETRY;
  33. 33. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Azul Systems: Phase 1 Vega (2005): a custom chip and OS to run JVM with a fully concurrent collector 1-cycle hardware instruction for read barrier Fast user-mode traps for GC-protected pages (enter and exit in 4-10 cycles) Some instructions are marked as safepoints and check a per-CPU safepoint interrupt flag Azul Vega 3 Up to 864 cores and 768GB of RAM
  34. 34. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Azul Systems: Phase 2 Zing (2010): an enhanced Linux kernel to run JVM with a fully concurrent collector on x86-64 hardware Software LVB (Loaded Value Barrier) + self-healing Old and young gen collections are concurrent and simultaneous with special “remembered sets” for both old-to-young and young-to-old refs New virtual memory subsystem to support superhigh memory remapping rates
  35. 35. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Azul’s Pauseless GC Mark phase: Parallel mark that sets a bit on each object reference to indicate it was marked through Gathers liveness total for each 1MB page New objects are created in untouchable pages Relocate phase: Sparse pages are protected from mutator access Objects from sparse pages are moved, forwarding information maintained outside the page The physical page is immediately recycled, the virtual page remains protected until the remap phase
  36. 36. Taxonomy RC M/S/C Concurrent Parallel Generations Azul Page States Allocating RW Page full Allocated RW Compacting Compacted New alloc Free Virtual Page Relocating Protected Relocated Unmapped Free Physical Page Relocated Protected Physical free Miscellanea
  37. 37. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Azul’s Pauseless GC Remap phase: Touches each live ref with a read barrier Virtual memory for the previously protected pages is freed Folded together with the next Mark phase! Self-healing: The read barrier fixes the reference (with a CAS) if the target has moved The read barrier takes an NMT-trap if the NMT bit for a ref is wrong, makes sure the Mark phase is aware of that ref, and uses a CAS to replace the object ref
  38. 38. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea What’s The Biggest Innovation Here? The page remap/protection logic! Compaction does not require an additional semispace unlike classic copying collectors Uses physical memory released by a compacted page as the compaction target for the next page The very rapid remap/protection rates require either custom hardware or memory mapping extensions Jumping the gap from ~5GB remaps/sec to ~5TB remaps/sec
  39. 39. Taxonomy RC M/S/C Concurrent Parallel Generations Parallel GC In a nutshell: perform the work in parallel In practical terms: how to divide the work? Work-stealing queues, almost no locks Miscellanea
  40. 40. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Parallel Work Distribution Mark Overpartition the roots into more chunks than threads Threads push new outgoing references to their local queues, other threads can help by stealing Copy/Compact Each thread has a copy finger pointing to a relatively large private area in TOSPACE to reduce contention Forwarding pointer updated using a CAS operation while multiple threads speculatively allocate space for it in their private areas
  41. 41. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Generations When measuring GC efficiency, we look at the ratio Bytes Freed/Time Elapsed In a multi-gigabyte stable heap, this metric can be very bad, e.g. 1MB/5sec 80-98% of new objects die within 1M instructions or 1MB of allocations A large fraction of the ones that survive 1-2 collections survive many collections
  42. 42. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Dividing The Heap Introduce multiple memory region for generations of objects .NET currently uses three generations Gen 0 for the newest objects, gen 2 for the oldest Typical .NET gen 0 budget: 1MB-16MB Gen 2 On most GC runs, consider only gen 0 objects Gen 1 for the white set and do not traverse grey objects from higher generations Typical stats: 1GB small allocs/sec, 0.5% GC time, average gen 0 GC latency: 200μs Gen 0
  43. 43. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Additional Tuning Small allocations from the younger generation(s) can be performed from threadlocal areas (TLAs) to reduce contention Large allocations that don’t fit or deemed too expensive to copy can be satisfied directly from the older generation .NET will never compact the large object heap without explicit instruction
  44. 44. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Inter-Generational Relations This design breaks if objects from gen 1/2 reference objects from gen 0 Write barrier to perform a.f = b: // assume that a is in ECX, b is in EBX cmp ecx, dword ptr [gEndOfGeneration0] jg SKIP mov edx, ecx shr edx, 10 xor eax, eax cmpxchg 1, byte ptr [pCardTable + edx] SKIP: mov dword ptr [ecx + OFFSET(f)], ebx Range Has ref to gen 0? 0 – 1023 No 1024 – 2047 Yes … No … Yes … No
  45. 45. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Young and Old Collections Gen 0/gen 1 collections are too short to make concurrent, and involve copying In gen 2 we can now make a tradeoff and occasionally do a blocking compacting GC We can even allow quick blocking gen 0/gen 1 collections during a gen 2 concurrent collection Microsoft calls this “background” vs. “foreground” GC
  46. 46. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Modern Performance Problems Excessive paging when doing full GC Occasional long and unpredictable pauses In a game, you usually have 16-33ms/frame Steep performance decline when most of the available memory is live “We feel so strongly about ARC being the right approach to memory management that we have decided to deprecate Garbage Collection in OS X.” [Apple, WWDC 2012]
  47. 47. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Apple Said WHAT? “…as long as their applications will be running on systems equipped with more than three times as much RAM as required, the garbage collection is a reasonable choice.” [Hertz et al.]
  48. 48. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Real-Time Garbage Collectors Primary property: guarantee a certain % utilization for your application in each time period Java Metronome (IBM WebSphere Real Time) % java -Xgcpolicy:metronome -Xgc:targetUtilization=80 -Xgc:targetPauseTime=10 realtime_app Uses a GC thread per processor, running in short quanta based on utilization constraints and available heap space
  49. 49. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Java Metronome Usually acts like a concurrent mark/sweep collector, no copying Occasionally performs defragmentation with copying in small, time-constrained units Move is concurrent and relies on a read barrier Super-optimized with resulting ~4% perf. hit Arraylets: breaking large arrays into fixed-size non-consecutive pieces to reduce scan and copy overhead and fragmentation
  50. 50. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Miscellaneous Optimizations Value/primitive types (stack allocations) Custom memory pools for specific uses (e.g., Android bitmap pool) Escape analysis: compile-time transformation from heap allocations to stack allocations Partial compaction based on information from sweep phase on heap segment utilization Switching between expensive and cheap read/write barriers depending on GC stage
  51. 51. Taxonomy RC M/S/C Concurrent Parallel Comparing Some Modern Collectors Nursery Runtime Collector Old Gen JVM ParallelGC STW Copy Concurrent STW Copy Miscellanea Remarks Par. STW M/S/C JVM Generations Conc. Mark STW Cmpct JVM optthruput STW Copy Conc. Par. Mark STW Cmpct CLR Conc. WKS STW Copy Conc. Mark STW Cmpct CLR Server STW Copy Conc. Par. Mark STW Cmpct JVM C4 (Azul) Conc. Cmpct Conc. Cmpct Ruby Rubinius STW Copy Conc. M/S No Cmpct JVM G1 STW Copy Conc. Mark Takes pause
  52. 52. Taxonomy RC M/S/C Concurrent Parallel Generations Finalization Automatic reclamation of unmanaged resources is somewhat of an afterthought Associate an object with a finalizer: class File { Guaranteed to be called private IntPtr handle; at some point after the File object is no longer // ... reachable by the program ~File() { NativeMethods.CloseHandle(handle); } } Miscellanea
  53. 53. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Finalization Details How do we run a method on an object if it is no longer reachable? Which thread is supposed to run that method? How is that thread going to know about it? GC thread Finalizer thread Objects that requested finalization Objects waiting for finalization
  54. 54. Taxonomy RC M/S/C Concurrent Parallel Generations Finalization Performance Performance issues with Java finalizers [Weimer] What happens here, assuming that the Statement class has registered a finalizer? for (int i = 0; i < 10000000; ++i) { Statement st = db.prepare("SELECT 1"); st.close(); } Miscellanea
  55. 55. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Finalization Performance Allocation rate of finalizable objects must be throttled Finalizable objects survive longer, which is bad for generational collectors Code execution is complicated by the introduction of an additional threads (memory races, deadlocks) Deterministic disposal is often necessary for frequently-used objects
  56. 56. Taxonomy RC M/S/C Concurrent Parallel Generations Miscellanea Summary Is garbage collection a good idea? Should we all switch to Java? Should we all switch to static char mem[10000000];? There is no clear cut answer There is progress There are tradeoffs
  57. 57. Questions
  58. 58. References Uniprocessor Garbage Collection Techniques [Wilson] Mostly Concurrent Compaction for Mark-Sweep GC [Ossia et al.] Concurrent Garbage Collection in Rubinius [Bussink] Parallel Garbage Collection for Shared Memory Multiprocessors [Flood et al.] Qualifying the Performance of Garbage Collection vs. Explicit Memory Management [Hertz et al.] Why mobile web apps are slow [Crawford] STOPLESS: A Real-Time Garbage Collector for Multiprocessors [Pizlo et al.] A real-time garbage collector with low overhead and consistent utilization [Bacon et al.] C4: The Continuously Concurrent Compacting Collector [Tene et al.] The Pauseless GC Algorithm [Click et al.]

×