Lp seminar

484 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
484
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lp seminar

  1. 1. On-the-Fly Garbage Collection Using Sliding Views Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni, Hezi Azatchi, and Harel Paz
  2. 2. Garbage Collection <ul><li>User allocates space dynamically, the garbage collector automatically frees the space when it “no longer needed”. </li></ul><ul><li>Usually “no longer needed” = unreachable by a path of pointers from program local references (roots). </li></ul><ul><li>Programmer does not have to decide when to free an object. (No memory leaks, no dereferencing of freed objects.) </li></ul><ul><li>Built into Java, C#. </li></ul>
  3. 3. Garbage Collection Two Classic Approaches Reference counting [Collins 1960]: keep a reference count for each object, reclaim objects with count 0. Tracing [McCarthy 1960]: trace reachable objects, reclaim objects not traced. Traditional Wisdom Good Problematic
  4. 4. What (was) Bad about RC ? <ul><li>Does not reclaim cycles </li></ul><ul><li>A heavy overhead on pointer modifications. </li></ul><ul><li>Traditional belief: “Cannot be used efficiently with parallel processing” </li></ul>A B
  5. 5. What’s Good about RC ? <ul><li>Reference Counting work is proportional to work on creations and modifications. </li></ul><ul><ul><li>Can tracing deal with tomorrow’s huge heaps? </li></ul></ul><ul><li>Reference counting has good locality. </li></ul><ul><li>The Challenge: </li></ul><ul><ul><li>RC overhead on pointer modification seems too expensive. </li></ul></ul><ul><ul><li>RC seems impossible to “parallelize”. </li></ul></ul>
  6. 6. Garbage Collection Today <ul><li>Today’s advanced environments: </li></ul><ul><ul><li>multiprocessors + large memories </li></ul></ul>Dealing with multiprocessors Single-threaded stop the world
  7. 7. Garbage Collection Today <ul><li>Today’s advanced environments: </li></ul><ul><ul><li>multiprocessors + large memories </li></ul></ul>Dealing with multiprocessors Concurrent collection Parallel collection
  8. 8. Terminology (stop the world, parallel, concurrent, …) Stop-the-World Parallel (STW) Concurrent On-the-Fly program GC
  9. 9. Benefits & Costs Informal Pause times 200ms 2ms 20ms Throughput Loss: 10-20% Stop-the-World Parallel (STW) Concurrent On-the-Fly program GC
  10. 10. This Talk <ul><li>Introduction: RC and Tracing, Coping with SMP’s. </li></ul><ul><li>RC introduction and parallelization problem. </li></ul><ul><li>Main focus: a novel concurrent reference counting algorithm (suitable for Java). </li></ul><ul><li>Concurrent made on-the-fly based on “sliding views” </li></ul><ul><li>Extensions: </li></ul><ul><ul><li>cycle collection, mark and sweep, generations, age-oriented. </li></ul></ul><ul><li>Implementation and measurements on Jikes. </li></ul><ul><ul><li>Extremely short pauses, good throughput . </li></ul></ul>
  11. 11. Basic Reference Counting <ul><li>Each object has an RC field, new objects get o.rc:=1. </li></ul><ul><li>When p that points to o 1 is modified to point to o 2 execute: o 2 .rc++, o 1 .rc--. </li></ul><ul><li>if then o 1 .rc==0: </li></ul><ul><ul><li>Delete o 1 . </li></ul></ul><ul><ul><li>Decrement o.rc for all children of o 1 . </li></ul></ul><ul><ul><li>Recursively delete objects whose rc is decremented to 0. </li></ul></ul>o 1 o 2 p
  12. 12. An Important Term: <ul><li>A write barrier is a piece of code executed with each pointer update. </li></ul><ul><li>“ p  o2 ” implies: Read p ; (see o1 ) p  o2 ; o2.rc ++; o1.rc - -; </li></ul>o 1 o 2 p
  13. 13. Deferred Reference Counting <ul><li>Problem: overhead on updating program variables (locals) is too high. </li></ul><ul><li>Solution [Deutch & Bobrow 76] : </li></ul><ul><ul><li>Don’t update rc for local variables (roots). </li></ul></ul><ul><ul><li>“ Once in a while”: collect all objects with o.rc=0 that are not referenced from local variables. </li></ul></ul><ul><li>Deferred RC reduces overhead by 80%. Used in most modern RC systems. </li></ul><ul><li>Still, “heap” write barrier is too costly. </li></ul>
  14. 14. Multithreaded RC? <ul><li>Traditional wisdom: write barrier must be synchronized ! </li></ul>
  15. 15. Multithreaded RC? <ul><li>Problem 1 : ref-counts updates must be atomic </li></ul><ul><li>Fortunately, this can be easily solved : Each thread logs required updates in a local buffer and the collector applies all the updates during GC (as a single thread). </li></ul>
  16. 16. Multithreaded RC? <ul><li>Problem 1 : ref-counts updates must be atomic </li></ul>A B D C Thread 2: Read A.next; (see B) A.next  D; B.rc- -; D.rc++ Thread 1: Read A.next; (see B) A.next  C; B.rc- -; C.rc++ <ul><li>Problem 2 : parallel updates confuse counters: </li></ul>
  17. 17. Known Multithreaded RC <ul><li>[DeTreville 1990, Bacon et al 2001]: </li></ul><ul><ul><li>Cmp & swp for each pointer modification. </li></ul></ul><ul><ul><li>Thread records its updates in a buffer. </li></ul></ul>
  18. 18. To Summarize Problems… <ul><li>Write barrier overhead is high. </li></ul><ul><ul><li>Even with deferred RC. </li></ul></ul><ul><li>Using RC with multithreading seems to bear high synchronization cost. </li></ul><ul><ul><li>Lock or “compare & swap” with each pointer update. </li></ul></ul>
  19. 19. Reducing RC Overhead: <ul><li>We start by looking at the “parent’s point of view”. </li></ul><ul><li>We are counting rc for the child, but rc changes when a parent’s pointer is modified. </li></ul>Parent Child
  20. 20. An Observation <ul><li>Consider a pointer p that takes the following values between GC’s: O 0 ,O 1 , O 2 , …, O n . </li></ul><ul><li>All RC algorithms perform 2n operations: </li></ul><ul><li>O 0 .rc--; O 1 .rc++; O 1 .rc--; O 2 .rc++; O 2 .rc--; … ; O n .rc++; </li></ul><ul><li>But only two operations are needed: </li></ul><ul><ul><li>O 0 .rc-- and O n .rc++ </li></ul></ul>p O 1 O 2 O 3 O n . . . . . O 4 O 0
  21. 21. Use of Observation <ul><li>Garbage Collection: For each modified slot p: </li></ul><ul><li>Read p to get O n , read records to get O 0. </li></ul><ul><li>O 0 .rc-- , O n .rc++ </li></ul>Time Only the first modification of each pointer is logged. Garbage Collection P  O 1 ; (record p’s previous value O 0 ) P  O 2 ; (do nothing) … P  O n ; (do nothing)
  22. 22. Some Technical Remarks <ul><li>When a pointer is first modified, it is marked “dirty” and its previous value is logged. </li></ul><ul><li>We actually log each object that gets modified (and not just a single pointer). </li></ul><ul><ul><li>Reason 1: we don’t want a dirty bit per pointer. </li></ul></ul><ul><ul><li>Reason 2: object’s pointers tend to be modified together. </li></ul></ul><ul><li>Only non-null pointer fields are logged. </li></ul><ul><li>New objects are “born dirty”. </li></ul>
  23. 23. Effects of Optimization <ul><li>RC work significantly reduced : </li></ul><ul><ul><li>The number of logging & counter updates is reduced by a factor of 100-1000 for typical Java benchmarks ! </li></ul></ul>
  24. 24. Elimination of RC Updates Mpegaudio 5,517,795 51 1/108192 Jess 26,258,107 27,333 1/961 Javac 22,042,028 535,296 1/41 Jack 135,174,775 1,546 1/87435 Db 33,124,780 30,696 1/1079 Compress 64,905 51 1/1273 jbb 71,011,357 264,115 1/269 Benchmark No of stores No of “first” stores Ratio of “first” stores
  25. 25. Effects of Optimization <ul><li>RC work significantly reduced : </li></ul><ul><ul><li>The number of logging & counter updates is reduced by a factor of 100-1000 for typical Java benchmarks ! </li></ul></ul><ul><li>Write barrier overhead dramatically reduced. </li></ul><ul><ul><li>The vast majority of the write barriers run a single “if”. </li></ul></ul><ul><li>Last but not least: the task has changed ! We need to record the first update. </li></ul>
  26. 26. Reducing Synch. Overhead <ul><li>Our second contribution: </li></ul><ul><li>A carefully designed write barrier (and an observation) does not require any sync. operation. </li></ul>
  27. 27. The write barrier Update(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) { log( slot, old ) SetDirty(slot) } *slot = new } <ul><li>Observation: </li></ul><ul><li>If two threads: </li></ul><ul><li>invoke the write barrier in parallel, and </li></ul><ul><li>both log an old value, </li></ul><ul><li>then both record the same </li></ul><ul><li>old value. </li></ul>
  28. 28. Running Write Barrier Concurrently Thread 1: Update(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) { /* if we got here, Thread 2 has */ /* yet set the dirty bit, thus, has */ /* not yet modified the slot. */ log( slot, old ) SetDirty(slot) } *slot = new } Thread 2: Update(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) { /* if we got here, Thread 1 has */ /* yet set the dirty bit, thus, has */ /* not yet modified the slot. */ log( slot, old ) SetDirty(slot) } *slot = new }
  29. 29. Concurrent Algorithm: <ul><li>Use write barrier with program threads. </li></ul><ul><li>To collect: </li></ul><ul><ul><li>Stop all threads </li></ul></ul><ul><ul><li>Scan roots (local variables) </li></ul></ul><ul><ul><li>get the buffers with modified slots </li></ul></ul><ul><ul><li>Clear all dirty bits. </li></ul></ul><ul><ul><li>Resume threads </li></ul></ul><ul><ul><li>For each modified slot: </li></ul></ul><ul><ul><ul><li>decrement rc for old value (written in buffer), </li></ul></ul></ul><ul><ul><ul><li>increment rc for current value (“read heap”), </li></ul></ul></ul><ul><ul><li>Reclaim non-local objects with rc 0. </li></ul></ul>
  30. 30. Timeline Stop threads. Scan roots; Get buffers; erase dirty bits; Resume threads. Decrement values in read buffers; Increment “current” values; Collect dead objects
  31. 31. Timeline Stop threads. Scan roots; Get buffers; erase dirty bits; Resume threads. Decrement values in read buffers; Increment “current” values; Collect dead objects Unmodified current values are in the heap. Modified are in new buffers.
  32. 32. Concurrent Algorithm: <ul><li>Use write barrier with program threads. </li></ul><ul><li>To collect: </li></ul><ul><ul><li>Stop all threads </li></ul></ul><ul><ul><li>Scan roots (local variables) </li></ul></ul><ul><ul><li>get the buffers with modified slots </li></ul></ul><ul><ul><li>Clear all dirty bits. </li></ul></ul><ul><ul><li>Resume threads </li></ul></ul><ul><ul><li>For each modified slot: </li></ul></ul><ul><ul><ul><li>decrease rc for old value (written in buffer), </li></ul></ul></ul><ul><ul><ul><li>increase rc for current value (“read heap”), </li></ul></ul></ul><ul><ul><li>Reclaim non-local objects with rc 0. </li></ul></ul>Goal 2: stop one thread at a time Goal 1: clear dirty bits during program run.
  33. 33. The Sliding Views “Framework” <ul><li>Develop a concurrent algorithm </li></ul><ul><ul><li>There is a short time in which all the threads are stopped simultaneously to perform some task. </li></ul></ul><ul><li>Avoid stopping the threads together. Instead, stop one thread at a time. </li></ul><ul><li>Tricky part: “fix” the problems created by this modification. </li></ul><ul><li>Idea borrowed from the Distributed Computing community [Lamport]. </li></ul>
  34. 34. Graphically A Snapshot A Sliding View time time Heap Addr. Heap Addr. t t1 t2
  35. 35. Fixing Correctness <ul><li>The way to do this in our algorithm is to use snooping : </li></ul><ul><ul><li>While collecting the roots, record objects that get a new pointer. </li></ul></ul><ul><ul><li>Do not reclaim these objects. </li></ul></ul><ul><li>No details… </li></ul>
  36. 36. Cycles Collection <ul><li>Our initial solution: use a tracing algorithm infrequently. </li></ul><ul><ul><li>More about this tracing collector and about cycle collectors later… </li></ul></ul>
  37. 37. Performance Measurements <ul><li>Implementation for Java on the Jikes Research JVM </li></ul><ul><li>Compared collectors: </li></ul><ul><ul><li>Jikes parallel stop-the-world ( STW ) </li></ul></ul><ul><ul><li>Jikes concurrent RC ( Jikes concurrent ) </li></ul></ul><ul><li>Benchmarks: </li></ul><ul><ul><li>SPECjbb2000: a server benchmark --- simulates business-like transactions. </li></ul></ul><ul><ul><li>SPECjvm98: a client benchmarks --- a suite of mostly single-threaded benchmarks </li></ul></ul>
  38. 38. Pause Times vs. STW
  39. 39. Pause Times vs. Jikes Concurrent
  40. 40. SPECjbb2000 Throughput
  41. 41. SPECjvm98 Throughput
  42. 42. SPECjbb2000 Throughput
  43. 43. A Glimpse into Subsequent Work: SPECjbb2000 Throughput
  44. 44. Subsequent Work <ul><li>Cycle Collection [CC’05]) </li></ul><ul><li>A Mark and Sweep Collector [OOPSLA’03] </li></ul><ul><li>A Generational Collector [CC’03] </li></ul><ul><li>An Age-Oriented Collector [CC’05] </li></ul>
  45. 45. Related Work <ul><li>It’s not clear where to start… </li></ul><ul><li>RC, concurrent, generational, etc… </li></ul><ul><li>Some more relevant work was mentioned. </li></ul>
  46. 46. Conclusions <ul><li>A Study of Concurrent Garbage Collection with a Focus on RC. </li></ul><ul><li>Novel techniques obtaining short pauses, high efficiency. </li></ul><ul><li>The best approach: age-oriented collection with concurrent RC for old and concurrent tracing for young. </li></ul><ul><li>Implementation and measurements on Jikes demonstrate non-obtrusiveness and high efficiency. </li></ul>
  47. 47. Project Building Blocks <ul><li>A novel reference counting algorithm. </li></ul><ul><li>State-of-the-art cycle collection . </li></ul><ul><li>Generational RC (for old) and tracing (for young) </li></ul><ul><li>A concurrent tracing collector. </li></ul><ul><li>An age-oriented collector: fitting generations with concurrent collectors. </li></ul>

×