0
Parallel worlds of                     CRubys GC                                 nari/Narihiro Nakamura/                  ...
Im very happy now.
Today is my firstpresentation in English.
My English is not good.
But, Ill do my best.Please bear with me :)
Self introduction
Ice-cream factory         ✓ I worked in an assembly line         ✓ For example, I made many           cardboard boxes.    ...
Ice-cream factory         ✓ I made 150 boxes per hour           (ZOMG)                                                    ...
I was like a machine!!           http://www.flickr.com/photos/kevincollins123/5887984753/
Working with Java         ✓ I worked in a big company.         ✓ This work was similar to           assembly line work..  ...
I was still like a     machine!!     http://www.flickr.com/photos/kevincollins123/5887984753/
My current work         ✓ Currently, I work at NaCl.         ✓ matz and shyouhei and takaokouji           are my co-worker...
When I started Ruby                           programming         ✓ I felt free.         ✓ This work wasnt similar to     ...
I was no longer   a machine!!      http://www.flickr.com/photos/danzden/121379782/
Garbage Collection for me         ✓ GC technology is very interesting           for me.         ✓ GC is a garbage collecti...
Im making a machine!!
My relationship to GC
Im a CRuby Committer         ✓ I work on GC.                                           24/207Parallel worlds of CRubys GC ...
And, I wrote abook about GC.
But, its only in Japanese :(
And, Ive been creating   GC with RDD.
What is RDD?
RDD = RubyKaigi Driven     Development
My RDD history         ✓ LazySweepGC - RubyKaigi2008         ✓ LonglifeGC - 2009         ✓ LazySweepGC - 2010         ✓ Pa...
My RDD history         ✓ LazySweepGC - RubyKaigi2008         ✓ LonglifeGC - 2009         ✓ LazySweepGC - 2010         ✓ Pa...
LonglifeGC         ✓ It treats long-life objects as a           special case.                  ✓ similar to Generational G...
But, LonglifeGC has                been      used in Kiji :-)            http://www.flickr.com/photos/conifer/2389654222/
Kiji         ✓ Kiji is an optimized version of           REE by Twitter developers.         ✓ The twitter team substantial...
But, Kiji will be rejected also... :(
My RDD history         ✓ LazySweepGC - RubyKaigi2008         ✓ LonglifeGC - 2009         ✓ LazySweepGC - 2010         ✓ Pa...
LazySweepGC         ✓ Traditional M&S GC executes           mark and sweep atomically.                  ✓ Ruby application...
LazySweepGC         ✓ Each invocation of the object           allocation sweeps Rubys heap                  ✓ until it fin...
Improvements         ✓ This improves the response time           of GC         ✓ I.e. the worst case time of GC           ...
LazySweepGC         ✓ You can use LazySweepGC since           Ruby 1.9.3                                                  ...
My RDD history         ✓ LazySweepGC - RubyKaigi2008         ✓ LonglifeGC - 2009         ✓ LazySweepGC - 2010         ✓ Pa...
Todays topics
Todays topics         ✓ Why do we need Parallel           Marking?         ✓ What to consider?         ✓ How to implement?...
Todays topics         ✓ Why do we need Parallel           Marking?         ✓ What to consider?         ✓ How to implement?...
Why do we need Parallel      Marking?
This is CRubys   current GC.
Current CRubys GC         ✓ GC operates on only 1 core.         ✓ In multi-core environment, other           cores dont he...
GC:"Im alone,    its so hard."                     http://www.flickr.com/photos/hortont/2698261070/
We should run GC in           parallel!!           http://www.flickr.com/photos/knallaerbse/2863161933/
First, Let me explain afew GC related concepts.
What is GC?         ✓ GC collects all dead objects.                                                     51/207Parallel wor...
What is a dead object?         ✓ A dead object is an object that is           never referenced by the program.         ✓ I...
What is Roots?         ✓ Roots is a set of pointers that           directly reference objects in the           program.   ...
For example                                                     54/207Parallel worlds of CRubys GC                 Powered...
Please remember that         ✓ GC collects objects that are           unreachable from Roots.                             ...
Next, Let me explain the  current CRuby GC       algorithm.
CRubys GC algorithm                         summary         ✓ CRuby adopts the Mark & Sweep           algorithm         ✓ ...
In the Mark phase         ✓ collector marks live objects that           are reachable from Roots.                         ...
For example                                                     59/207Parallel worlds of CRubys GC                 Powered...
Mark phase with GC.start                                          60/207Parallel worlds of CRubys GC      Powered by Rabbi...
Ruby Heap after marking                                          61/207Parallel worlds of CRubys GC      Powered by Rabbit...
In the Sweep phase         ✓ collector sweeps "dead" objects                  ✓ "dead" means unmarked                  ✓ "...
Sweep phase                                                     63/207Parallel worlds of CRubys GC                 Powered...
Characteristics of  CRubys GC
Characteristics         ✓ The stop-the-world algorithm         ✓ Single thread execution                                  ...
Recently, PC has multi-core               processors. But,         ✓ GC executes on a single thread.         ✓ Other cores...
How can we fix this?
UseParallel Marking,Luke
What is Parallel Marking?
What is Parallel Marking?         ✓ Collector run several marking           processes in parallel                  ✓ by us...
Flow diagram for Parallel                      Marking                                            71/207Parallel worlds of...
BTW:  Why not performsweeping in parallel?
Why not perform sweeping in              parallel         ✓ The sweeping is much faster than           the marking.       ...
Why not perform sweeping in              parallel         ✓ So, Mark phase improvement =           GC improvement         ...
Todays topics         ✓ Why do we need Parallel           Marking?         ✓ What to consider?         ✓ How to implement?...
What to consider whenimplementing Parallel     Marking?
We should consider two                         problems         ✓ Workload balancing         ✓ Wait-free algorithm        ...
Workload balancing
How can we divide themarking task into sub-       tasks?
I tried think about a  simple approach.
1 branch of Roots ismarked by 1 thread.
This means..         ✓ Tasks are distributed to multiple           threads.         ✓ The task of marking the entire      ...
This seems to be no     problem.
But actually, this solutionsuffers from the workload         problem.
Each thread doesnt know what the other           threads are doing.
For instance, if A and B finishes work                 early,
then, they will stop doing anything :(
I think "machines should     work forever" :D
So, I think A and B      should ...
http://www.flickr.com/photos/ryanr/157458385/
Parallel Marking with   Task Stealing.
If A and B finishes work early,
This is called"Task Stealing"
We should consider two                         problems         ✓ Workload balancing         ✓ Wait-free algorithm        ...
Wait-free algorithm
What does "wait-free" mean?         ✓ A wait-free program does non-           blocking execution.         ✓ It guarantees ...
Why is wait-free important?
Amdahls law
Amdahls law                                is used to find the                                maximum expected            ...
Amdahls law is used in                     parallel computing         ✓ If parallel portion of the system is           X% ...
Its worse than expected,          right?
The conclusion so far
The conclusion so far         ✓ We should consider how we can           efficiently balance workloads.                  ✓ ...
Todays topics         ✓ Why do we need Parallel           Marking?         ✓ What to consider?         ✓ How to implement?...
How to implementParallel Marking?
Task Stealing         ✓ In Task Stealing, threads steal           tasks from each other         ✓ Task Stealing is achieve...
Aroras Deque         ✓ Deque stands for the Double-           Ended Queue.         ✓ In Aroras Deque, the deque           ...
Aroras Deque has only   three operations.
Each mark worker has a single deque.
Only the owner can call pop() and push().
Worker can call shift() to steal other         workers deque.
"Hey wait a minute,  doesnt shift() havecontention problems?"
In what ways could shift()          cause contention problems?                     e.g...         ✓ Multi-thread (workers)...
In what ways could shift()          cause contention problems?                     e.g...         ✓ shift() and pop() coul...
But, Aroras Deque avoidsthese contention problems.
Serialization         ✓ shift() is serialized by using CAS.                  ✓ CAS = Compare And Swap         ✓ And, this ...
I omit details of theimplementation of the    serialization.
For the sake of thispresentation, lets assumethat Aroras Deque avoids  contention problems.
Summary for Aroras Deque         ✓ A simple data structure for Task           Stealing.         ✓ Each worker has a single...
How to use Aroras Deque  in Parallel Marking?
First try:A task is an object.
Lets say that worker A has a branch that        is composed of 4 objects.
We start by marking A and pushing it to              the deque.
pop A, mark B and C, push B and C.
pop C, mark D, push D
pop D, pop B
This is a branch marking.
How do you steal?
Suppose that worker1 has task B and C.        Worker2 has no task.
Worker2 steals task B on Worker1 by           using shift().
Summary         ✓ Marker uses Aroras Deque as a           marking stack.         ✓ A "task" means an object.              ...
I implemented this    approach.
But..
Its slowerthan original GC.
OMG...http://www.flickr.com/photos/emariephotos/4958245676/
I fell intothe Pitfalls ofParallel Processing(PPP!!!)
Why slow?
Why slow?         ✓ pop(),push(),shift() are called           frequently.                  ✓ Because deque has fine-graine...
How to fix this?
We can make the tasks  less fine-grained.
A task is a branch
All branches in Roots are divided   roughly among the deques.
Each Worker marks a branch in its deque.
When the deque is empty, the workersteals a branch from another worker.
like this!!
Good point & Bad point         ✓ Number of calls to Deques           operations was reduced.                  ✓ Marking sp...
Why do coarse-grained   tasks decrease    parallelism?
Tasks may involve a large        branch.
If an object in Bs branch has many child                 objects..
.. then A cant steal it while B is marking             the large branch.
So, the worker needs totreat large branches as      special cases.
Almost all large branches hold large Array objectsand/or large Hash objects.
Treatment for large Array               objects and Hash objects         ✓ Each marker has a special deque           to ma...
Treatment for Large Array                  and Hash         ✓ By doing this, other workers can           steal divided tas...
Summary         ✓ The naive implementation was           slow.                  ✓ Grain of the task was too fine.         ...
Todays topics         ✓ Why do we need Parallel           Marking?         ✓ What to consider?         ✓ How to implement?...
How much didperformance improve?
These are my machine specs         ✓ My machine has only 2 cores         ✓ Memory: 8GB         ✓ OS: Linux                ...
Parallel marking uses 4  marking threads.
First benchmark program is         ✓ make benchmark                  ✓ This is the benchmark which used in                ...
Why does this seem so slow?         ✓ I think its affected by Parallel           Markings preparation.                  ✓ ...
Why does this seem so slow?         ✓ In most of the benchmarks, the           mark target objects are few.               ...
Next benchmark program is         ✓ make rdoc                  ✓ make rdoc generates the Ruby                    documenta...
make rdoc         ✓ It takes about 80 seconds on my           machine.         ✓ In fact, 30% of that time is spent       ...
All GC time is improved       by 40%!
So fast!!
In many core environment         ✓ I expect we get a large           improvement.                  ✓ e.g. 8 core, 16 core....
Best case for Parallel GC         ✓ If the objects are many.                  ✓ In this case, mark targets is also many.  ...
Demo
Demonstration         ✓ I want to show the performance           improvement with Parallel GC.         ✓ This demonstratio...
Let me explain about this         game.
And, Character has HP.
When GC runs,
the character loses HP while waiting for            the GC to finish.
We must reach the goal before HP run                out.
Other characteristics of                     SUPER NARIO GC         ✓ GC is running in fixed intervals.         ✓ A lot of...
Try to compare Original GC              and Parallel GC         ✓ Original GC pause time is long.                  ✓ This ...
OK, Lets try!
DEMOOriginal GC version
Oops.. so difficult!!!
DEMOParallel GC version
Wow!! Easy!!!!
Lets compare average       times GC
Fast!!
Remaining Problems
Windows OS is not supported         ✓ Mark Worker uses pthread as           native thread.         ✓ And, uses some gcc bu...
Increased memory usage.         ✓ Size of 1 Deque is roughly 32KB.         ✓ But generally multi-core machine           ha...
Conclusion
Conclusion         ✓ I implemented Parallel Marking           GC         ✓ GC was improved!                  ✓ Ill report ...
Conclusion         ✓ But, Parallel Marking has some           problems.                  ✓ Ill fix these.                 ...
source code         ✓ Parallel Marking GC                  ✓ <URL:https://github.com/authorNari/                    ruby/t...
Acknowledgments         ✓ Following people helped me           make this presentation!!                  ✓ Tor-san!!      ...
Thank you!!!
Do you have any     questions?Please short and simple      questions :)
Sorry         ✓ Its too difficult for me to           understand/answer the question.         ✓ Could be send the question...
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Parallel worlds of CRuby's GC
Upcoming SlideShare
Loading in...5
×

Parallel worlds of CRuby's GC

22,383

Published on

I talked this presentation at rubyconf 2011. yay!

Published in: Technology, Business
1 Comment
21 Likes
Statistics
Notes
No Downloads
Views
Total Views
22,383
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
49
Comments
1
Likes
21
Embeds 0
No embeds

No notes for slide

Transcript of "Parallel worlds of CRuby's GC"

  1. 1. Parallel worlds of CRubys GC nari/Narihiro Nakamura/ @nari_en Network Applied Communication Laboratory Ltd.Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  2. 2. Im very happy now.
  3. 3. Today is my firstpresentation in English.
  4. 4. My English is not good.
  5. 5. But, Ill do my best.Please bear with me :)
  6. 6. Self introduction
  7. 7. Ice-cream factory ✓ I worked in an assembly line ✓ For example, I made many cardboard boxes. ✓ I was a professional cardboard box maker :) 8/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  8. 8. Ice-cream factory ✓ I made 150 boxes per hour (ZOMG) 9/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  9. 9. I was like a machine!! http://www.flickr.com/photos/kevincollins123/5887984753/
  10. 10. Working with Java ✓ I worked in a big company. ✓ This work was similar to assembly line work.. ✓ I made a part of a product. I didnt understand whole product. 13/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  11. 11. I was still like a machine!! http://www.flickr.com/photos/kevincollins123/5887984753/
  12. 12. My current work ✓ Currently, I work at NaCl. ✓ matz and shyouhei and takaokouji are my co-workers. ✓ shugo is my boss. ✓ They are CRuby committers. 17/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  13. 13. When I started Ruby programming ✓ I felt free. ✓ This work wasnt similar to assembly line work. ✓ I could make the whole product. 18/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  14. 14. I was no longer a machine!! http://www.flickr.com/photos/danzden/121379782/
  15. 15. Garbage Collection for me ✓ GC technology is very interesting for me. ✓ GC is a garbage collecting machine. ✓ Ive been creating it since then. Its very fun!! 21/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  16. 16. Im making a machine!!
  17. 17. My relationship to GC
  18. 18. Im a CRuby Committer ✓ I work on GC. 24/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  19. 19. And, I wrote abook about GC.
  20. 20. But, its only in Japanese :(
  21. 21. And, Ive been creating GC with RDD.
  22. 22. What is RDD?
  23. 23. RDD = RubyKaigi Driven Development
  24. 24. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 30/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  25. 25. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 31/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  26. 26. LonglifeGC ✓ It treats long-life objects as a special case. ✓ similar to Generational GC. ✓ LonglifeGC was rejected in CRuby 1.9.2 by some reason. ✓ :( 32/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  27. 27. But, LonglifeGC has been used in Kiji :-) http://www.flickr.com/photos/conifer/2389654222/
  28. 28. Kiji ✓ Kiji is an optimized version of REE by Twitter developers. ✓ The twitter team substantially extended LonglifeGC. ✓ Its cool!! 34/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  29. 29. But, Kiji will be rejected also... :(
  30. 30. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 36/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  31. 31. LazySweepGC ✓ Traditional M&S GC executes mark and sweep atomically. ✓ Ruby application stops during GC (stop-the-world). ✓ In Lazy sweeping, sweeping is lazy. 37/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  32. 32. LazySweepGC ✓ Each invocation of the object allocation sweeps Rubys heap ✓ until it finds an appropriate free object. 38/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  33. 33. Improvements ✓ This improves the response time of GC ✓ I.e. the worst case time of GC decreases. 39/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  34. 34. LazySweepGC ✓ You can use LazySweepGC since Ruby 1.9.3 40/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  35. 35. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 41/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  36. 36. Todays topics
  37. 37. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 43/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  38. 38. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 44/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  39. 39. Why do we need Parallel Marking?
  40. 40. This is CRubys current GC.
  41. 41. Current CRubys GC ✓ GC operates on only 1 core. ✓ In multi-core environment, other cores dont help GC. 47/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  42. 42. GC:"Im alone, its so hard." http://www.flickr.com/photos/hortont/2698261070/
  43. 43. We should run GC in parallel!! http://www.flickr.com/photos/knallaerbse/2863161933/
  44. 44. First, Let me explain afew GC related concepts.
  45. 45. What is GC? ✓ GC collects all dead objects. 51/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  46. 46. What is a dead object? ✓ A dead object is an object that is never referenced by the program. ✓ In GC terms, we say a that dead object is unreachable from Roots. 52/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  47. 47. What is Roots? ✓ Roots is a set of pointers that directly reference objects in the program. ✓ e.g. Rubys local variables, etc.. 53/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  48. 48. For example 54/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  49. 49. Please remember that ✓ GC collects objects that are unreachable from Roots. 55/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  50. 50. Next, Let me explain the current CRuby GC algorithm.
  51. 51. CRubys GC algorithm summary ✓ CRuby adopts the Mark & Sweep algorithm ✓ Collector works in separate Mark and Sweep phases. 57/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  52. 52. In the Mark phase ✓ collector marks live objects that are reachable from Roots. 58/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  53. 53. For example 59/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  54. 54. Mark phase with GC.start 60/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  55. 55. Ruby Heap after marking 61/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  56. 56. In the Sweep phase ✓ collector sweeps "dead" objects ✓ "dead" means unmarked ✓ "dead" means unreachable from Roots 62/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  57. 57. Sweep phase 63/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  58. 58. Characteristics of CRubys GC
  59. 59. Characteristics ✓ The stop-the-world algorithm ✓ Single thread execution 65/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  60. 60. Recently, PC has multi-core processors. But, ✓ GC executes on a single thread. ✓ Other cores dont work during GC. ✓ What a waste!! 66/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  61. 61. How can we fix this?
  62. 62. UseParallel Marking,Luke
  63. 63. What is Parallel Marking?
  64. 64. What is Parallel Marking? ✓ Collector run several marking processes in parallel ✓ by using native threads. ✓ We will be happy on multi-core machine. 70/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  65. 65. Flow diagram for Parallel Marking 71/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  66. 66. BTW: Why not performsweeping in parallel?
  67. 67. Why not perform sweeping in parallel ✓ The sweeping is much faster than the marking. ✓ You can see ko1s research ✓ <URL:http://www.atdot.net/~ko1/ diary/201011.html#d4> 73/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  68. 68. Why not perform sweeping in parallel ✓ So, Mark phase improvement = GC improvement ✓ And, we already have the lazy sweeping. 74/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  69. 69. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 75/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  70. 70. What to consider whenimplementing Parallel Marking?
  71. 71. We should consider two problems ✓ Workload balancing ✓ Wait-free algorithm 77/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  72. 72. Workload balancing
  73. 73. How can we divide themarking task into sub- tasks?
  74. 74. I tried think about a simple approach.
  75. 75. 1 branch of Roots ismarked by 1 thread.
  76. 76. This means.. ✓ Tasks are distributed to multiple threads. ✓ The task of marking the entire heap is divided into several tasks, each marking a single branch. 84/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  77. 77. This seems to be no problem.
  78. 78. But actually, this solutionsuffers from the workload problem.
  79. 79. Each thread doesnt know what the other threads are doing.
  80. 80. For instance, if A and B finishes work early,
  81. 81. then, they will stop doing anything :(
  82. 82. I think "machines should work forever" :D
  83. 83. So, I think A and B should ...
  84. 84. http://www.flickr.com/photos/ryanr/157458385/
  85. 85. Parallel Marking with Task Stealing.
  86. 86. If A and B finishes work early,
  87. 87. This is called"Task Stealing"
  88. 88. We should consider two problems ✓ Workload balancing ✓ Wait-free algorithm 97/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  89. 89. Wait-free algorithm
  90. 90. What does "wait-free" mean? ✓ A wait-free program does non- blocking execution. ✓ It guarantees per-thread progress. 99/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  91. 91. Why is wait-free important?
  92. 92. Amdahls law
  93. 93. Amdahls law is used to find the maximum expected improvement to an overall system when only part of the system is improved. [cited from `Amdahls law - Wikipedia] 102/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  94. 94. Amdahls law is used in parallel computing ✓ If parallel portion of the system is X% ✓ And number of processors is Y, ✓ How much speedup can we expect? 103/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  95. 95. Its worse than expected, right?
  96. 96. The conclusion so far
  97. 97. The conclusion so far ✓ We should consider how we can efficiently balance workloads. ✓ So, we use Task Stealing. ✓ We should eliminate non-parallel parts ✓ by using wait-free algorithm. 109/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  98. 98. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve 110/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  99. 99. How to implementParallel Marking?
  100. 100. Task Stealing ✓ In Task Stealing, threads steal tasks from each other ✓ Task Stealing is achieved with Aroras Deque 112/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  101. 101. Aroras Deque ✓ Deque stands for the Double- Ended Queue. ✓ In Aroras Deque, the deque contains tasks as elements. ✓ Its a wait-free data structure. 113/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  102. 102. Aroras Deque has only three operations.
  103. 103. Each mark worker has a single deque.
  104. 104. Only the owner can call pop() and push().
  105. 105. Worker can call shift() to steal other workers deque.
  106. 106. "Hey wait a minute, doesnt shift() havecontention problems?"
  107. 107. In what ways could shift() cause contention problems? e.g... ✓ Multi-thread (workers) may call shift() of same deque at the same time. 122/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  108. 108. In what ways could shift() cause contention problems? e.g... ✓ shift() and pop() could be called at the same time ✓ when deque has only one element. 123/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  109. 109. But, Aroras Deque avoidsthese contention problems.
  110. 110. Serialization ✓ shift() is serialized by using CAS. ✓ CAS = Compare And Swap ✓ And, this serialization doesnt use a lock. ✓ Its wait-free!! 125/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  111. 111. I omit details of theimplementation of the serialization.
  112. 112. For the sake of thispresentation, lets assumethat Aroras Deque avoids contention problems.
  113. 113. Summary for Aroras Deque ✓ A simple data structure for Task Stealing. ✓ Each worker has a single deque. ✓ Stealing (shift operation) is wait- free! 128/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  114. 114. How to use Aroras Deque in Parallel Marking?
  115. 115. First try:A task is an object.
  116. 116. Lets say that worker A has a branch that is composed of 4 objects.
  117. 117. We start by marking A and pushing it to the deque.
  118. 118. pop A, mark B and C, push B and C.
  119. 119. pop C, mark D, push D
  120. 120. pop D, pop B
  121. 121. This is a branch marking.
  122. 122. How do you steal?
  123. 123. Suppose that worker1 has task B and C. Worker2 has no task.
  124. 124. Worker2 steals task B on Worker1 by using shift().
  125. 125. Summary ✓ Marker uses Aroras Deque as a marking stack. ✓ A "task" means an object. ✓ The granularity of the task is very fine. ✓ This is a naive implementation. 140/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  126. 126. I implemented this approach.
  127. 127. But..
  128. 128. Its slowerthan original GC.
  129. 129. OMG...http://www.flickr.com/photos/emariephotos/4958245676/
  130. 130. I fell intothe Pitfalls ofParallel Processing(PPP!!!)
  131. 131. Why slow?
  132. 132. Why slow? ✓ pop(),push(),shift() are called frequently. ✓ Because deque has fine-grained tasks. ✓ Their overhead is too big. 147/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  133. 133. How to fix this?
  134. 134. We can make the tasks less fine-grained.
  135. 135. A task is a branch
  136. 136. All branches in Roots are divided roughly among the deques.
  137. 137. Each Worker marks a branch in its deque.
  138. 138. When the deque is empty, the workersteals a branch from another worker.
  139. 139. like this!!
  140. 140. Good point & Bad point ✓ Number of calls to Deques operations was reduced. ✓ Marking speed of the worker is improved. ✓ However, Coarse-grained tasks decrease parallelism. 155/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  141. 141. Why do coarse-grained tasks decrease parallelism?
  142. 142. Tasks may involve a large branch.
  143. 143. If an object in Bs branch has many child objects..
  144. 144. .. then A cant steal it while B is marking the large branch.
  145. 145. So, the worker needs totreat large branches as special cases.
  146. 146. Almost all large branches hold large Array objectsand/or large Hash objects.
  147. 147. Treatment for large Array objects and Hash objects ✓ Each marker has a special deque to manage them. ✓ A marker divides them into fixed size tasks. ✓ e.g. 0-9 elements of Array, 10-19 elements of Array... 162/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  148. 148. Treatment for Large Array and Hash ✓ By doing this, other workers can steal divided tasks. ✓ This improves parallelism. 163/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  149. 149. Summary ✓ The naive implementation was slow. ✓ Grain of the task was too fine. ✓ A "task" means a branch in Roots ✓ Grain of the task is coarse. ✓ Its faster!! 164/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  150. 150. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 165/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  151. 151. How much didperformance improve?
  152. 152. These are my machine specs ✓ My machine has only 2 cores ✓ Memory: 8GB ✓ OS: Linux 167/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  153. 153. Parallel marking uses 4 marking threads.
  154. 154. First benchmark program is ✓ make benchmark ✓ This is the benchmark which used in CRuby development 169/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  155. 155. Why does this seem so slow? ✓ I think its affected by Parallel Markings preparation. ✓ e.g. creating marking threads, allocation of deques. 171/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  156. 156. Why does this seem so slow? ✓ In most of the benchmarks, the mark target objects are few. ✓ In this case, Parallel Marking cost is expensive. 172/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  157. 157. Next benchmark program is ✓ make rdoc ✓ make rdoc generates the Ruby documentation. ✓ This benchmark measures execution time and the GC execution time of make rdoc. 173/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  158. 158. make rdoc ✓ It takes about 80 seconds on my machine. ✓ In fact, 30% of that time is spent on GC!! ✓ How much did performance improve? 174/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  159. 159. All GC time is improved by 40%!
  160. 160. So fast!!
  161. 161. In many core environment ✓ I expect we get a large improvement. ✓ e.g. 8 core, 16 core... ✓ But, my machine has just 2 cores. ✓ I cant see it :( 178/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  162. 162. Best case for Parallel GC ✓ If the objects are many. ✓ In this case, mark targets is also many. ✓ If the objects are long-lived. ✓ Server-side application? 179/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  163. 163. Demo
  164. 164. Demonstration ✓ I want to show the performance improvement with Parallel GC. ✓ This demonstration is video game style. 181/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  165. 165. Let me explain about this game.
  166. 166. And, Character has HP.
  167. 167. When GC runs,
  168. 168. the character loses HP while waiting for the GC to finish.
  169. 169. We must reach the goal before HP run out.
  170. 170. Other characteristics of SUPER NARIO GC ✓ GC is running in fixed intervals. ✓ A lot of objects are generated to increase GCs burden. ✓ Burden = Game Level 187/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  171. 171. Try to compare Original GC and Parallel GC ✓ Original GC pause time is long. ✓ This game will be difficult. ✓ Parallel GC pause time is short. ✓ This game will be easy. 188/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  172. 172. OK, Lets try!
  173. 173. DEMOOriginal GC version
  174. 174. Oops.. so difficult!!!
  175. 175. DEMOParallel GC version
  176. 176. Wow!! Easy!!!!
  177. 177. Lets compare average times GC
  178. 178. Fast!!
  179. 179. Remaining Problems
  180. 180. Windows OS is not supported ✓ Mark Worker uses pthread as native thread. ✓ And, uses some gcc built-in functions. ✓ But, Ill support for Windows eventually. 198/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  181. 181. Increased memory usage. ✓ Size of 1 Deque is roughly 32KB. ✓ But generally multi-core machine have plenty of memory. ✓ So, I think its OK :P 199/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  182. 182. Conclusion
  183. 183. Conclusion ✓ I implemented Parallel Marking GC ✓ GC was improved! ✓ Ill report to ruby-core soon. 201/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  184. 184. Conclusion ✓ But, Parallel Marking has some problems. ✓ Ill fix these. 202/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  185. 185. source code ✓ Parallel Marking GC ✓ <URL:https://github.com/authorNari/ ruby/tree/pmark_div_root2> ✓ SUPER NARIO GC ✓ <URL:https://github.com/authorNari/ nario/> 203/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  186. 186. Acknowledgments ✓ Following people helped me make this presentation!! ✓ Tor-san!! ✓ matz, shugo, yhara, sada, takaokouji, other co-workers!! 204/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  187. 187. Thank you!!!
  188. 188. Do you have any questions?Please short and simple questions :)
  189. 189. Sorry ✓ Its too difficult for me to understand/answer the question. ✓ Could be send the question on twitter(@nari_en)? 207/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×