Successfully reported this slideshow.
Your SlideShare is downloading. ×

Parallel worlds of CRuby's GC

Ad

Parallel worlds of
                     CRuby's GC
                                 nari/Narihiro Nakamura/
              ...

Ad

I'm very happy now.

Ad

Today is my first
presentation in English.

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 208 Ad
1 of 208 Ad
Advertisement

More Related Content

Viewers also liked (20)

Advertisement
Advertisement

Parallel worlds of CRuby's GC

  1. Parallel worlds of CRuby's GC nari/Narihiro Nakamura/ @nari_en Network Applied Communication Laboratory Ltd. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  2. I'm very happy now.
  3. Today is my first presentation in English.
  4. My English is not good.
  5. But, I'll do my best. Please bear with me :)
  6. Self introduction
  7. Ice-cream factory ✓ I worked in an assembly line ✓ For example, I made many cardboard boxes. ✓ I was a professional cardboard box maker :) 8/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  8. Ice-cream factory ✓ I made 150 boxes per hour (ZOMG) 9/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  9. I was like a machine!! http://www.flickr.com/photos/kevincollins123/5887984753/
  10. Working with Java ✓ I worked in a big company. ✓ This work was similar to assembly line work.. ✓ I made a part of a product. I didn't understand whole product. 13/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  11. I was still like a machine!! http://www.flickr.com/photos/kevincollins123/5887984753/
  12. My current work ✓ Currently, I work at NaCl. ✓ matz and shyouhei and takaokouji are my co-workers. ✓ shugo is my boss. ✓ They are CRuby committers. 17/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  13. When I started Ruby programming ✓ I felt free. ✓ This work wasn't similar to assembly line work. ✓ I could make the whole product. 18/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  14. I was no longer a machine!! http://www.flickr.com/photos/danzden/121379782/
  15. Garbage Collection for me ✓ GC technology is very interesting for me. ✓ GC is a garbage collecting machine. ✓ I've been creating it since then. It's very fun!! 21/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  16. I'm making a machine!!
  17. My relationship to GC
  18. I'm a CRuby Committer ✓ I work on GC. 24/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  19. And, I wrote a book about GC.
  20. But, it's only in Japanese :(
  21. And, I've been creating GC with RDD.
  22. What is RDD?
  23. RDD = RubyKaigi Driven Development
  24. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 30/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  25. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 31/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  26. LonglifeGC ✓ It treats long-life objects as a special case. ✓ similar to Generational GC. ✓ LonglifeGC was rejected in CRuby 1.9.2 by some reason. ✓ :'( 32/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  27. But, LonglifeGC has been used in Kiji :-) http://www.flickr.com/photos/conifer/2389654222/
  28. Kiji ✓ Kiji is an optimized version of REE by Twitter developers. ✓ The twitter team substantially extended LonglifeGC. ✓ It's cool!! 34/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  29. But, Kiji will be rejected also... :'(
  30. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 36/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  31. LazySweepGC ✓ Traditional M&S GC executes mark and sweep atomically. ✓ Ruby application stops during GC (stop-the-world). ✓ In Lazy sweeping, sweeping is lazy. 37/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  32. LazySweepGC ✓ Each invocation of the object allocation sweeps Ruby's heap ✓ until it finds an appropriate free object. 38/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  33. Improvements ✓ This improves the response time of GC ✓ I.e. the worst case time of GC decreases. 39/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  34. LazySweepGC ✓ You can use LazySweepGC since Ruby 1.9.3 40/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  35. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 41/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  36. Today's topics
  37. Today's topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 43/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  38. Today's topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 44/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  39. Why do we need Parallel Marking?
  40. This is CRuby's current GC.
  41. Current CRuby's GC ✓ GC operates on only 1 core. ✓ In multi-core environment, other cores don't help GC. 47/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  42. GC:"I'm alone, it's so hard." http://www.flickr.com/photos/hortont/2698261070/
  43. We should run GC in parallel!! http://www.flickr.com/photos/knallaerbse/2863161933/
  44. First, Let me explain a few GC related concepts.
  45. What is GC? ✓ GC collects all dead objects. 51/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  46. What is a dead object? ✓ A dead object is an object that is never referenced by the program. ✓ In GC terms, we say a that dead object is unreachable from Roots. 52/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  47. What is Roots? ✓ Roots is a set of pointers that directly reference objects in the program. ✓ e.g. Ruby's local variables, etc.. 53/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  48. For example 54/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  49. Please remember that ✓ GC collects objects that are unreachable from Roots. 55/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  50. Next, Let me explain the current CRuby GC algorithm.
  51. CRuby's GC algorithm summary ✓ CRuby adopts the Mark & Sweep algorithm ✓ Collector works in separate Mark and Sweep phases. 57/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  52. In the Mark phase ✓ collector marks live objects that are reachable from Roots. 58/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  53. For example 59/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  54. Mark phase with GC.start 60/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  55. Ruby Heap after marking 61/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  56. In the Sweep phase ✓ collector sweeps "dead" objects ✓ "dead" means unmarked ✓ "dead" means unreachable from Roots 62/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  57. Sweep phase 63/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  58. Characteristics of CRuby's GC
  59. Characteristics ✓ The stop-the-world algorithm ✓ Single thread execution 65/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  60. Recently, PC has multi-core processors. But, ✓ GC executes on a single thread. ✓ Other cores don't work during GC. ✓ What a waste!! 66/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  61. How can we fix this?
  62. Use Parallel Marking, Luke
  63. What is Parallel Marking?
  64. What is Parallel Marking? ✓ Collector run several marking processes in parallel ✓ by using native threads. ✓ We will be happy on multi-core machine. 70/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  65. Flow diagram for Parallel Marking 71/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  66. BTW: Why not perform sweeping in parallel?
  67. Why not perform sweeping in parallel ✓ The sweeping is much faster than the marking. ✓ You can see ko1's research ✓ <URL:http://www.atdot.net/~ko1/ diary/201011.html#d4> 73/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  68. Why not perform sweeping in parallel ✓ So, Mark phase improvement = GC improvement ✓ And, we already have the lazy sweeping. 74/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  69. Today's topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 75/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  70. What to consider when implementing Parallel Marking?
  71. We should consider two problems ✓ Workload balancing ✓ Wait-free algorithm 77/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  72. Workload balancing
  73. How can we divide the marking task into sub- tasks?
  74. I tried think about a simple approach.
  75. 1 branch of Roots is marked by 1 thread.
  76. This means.. ✓ Tasks are distributed to multiple threads. ✓ The task of marking the entire heap is divided into several tasks, each marking a single branch. 84/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  77. This seems to be no problem.
  78. But actually, this solution suffers from the workload problem.
  79. Each thread doesn't know what the other threads are doing.
  80. For instance, if A and B finishes work early,
  81. then, they will stop doing anything :(
  82. I think "machines should work forever" :D
  83. So, I think A and B should ...
  84. http://www.flickr.com/photos/ryanr/157458385/
  85. Parallel Marking with Task Stealing.
  86. If A and B finishes work early,
  87. This is called "Task Stealing"
  88. We should consider two problems ✓ Workload balancing ✓ Wait-free algorithm 97/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  89. Wait-free algorithm
  90. What does "wait-free" mean? ✓ A wait-free program does non- blocking execution. ✓ It guarantees per-thread progress. 99/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  91. Why is wait-free important?
  92. Amdahl's law
  93. Amdahl's law is used to find the maximum expected improvement to an overall system when only part of the system is improved. [cited from `Amdahl's law - Wikipedia'] 102/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  94. Amdahl's law is used in parallel computing ✓ If parallel portion of the system is X% ✓ And number of processors is Y, ✓ How much speedup can we expect? 103/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  95. It's worse than expected, right?
  96. The conclusion so far
  97. The conclusion so far ✓ We should consider how we can efficiently balance workloads. ✓ So, we use Task Stealing. ✓ We should eliminate non-parallel parts ✓ by using wait-free algorithm. 109/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  98. Today's topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve 110/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  99. How to implement Parallel Marking?
  100. Task Stealing ✓ In Task Stealing, threads steal tasks from each other ✓ Task Stealing is achieved with Arora's Deque 112/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  101. Arora's Deque ✓ Deque stands for the Double- Ended Queue. ✓ In Arora's Deque, the deque contains tasks as elements. ✓ It's a wait-free data structure. 113/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  102. Arora's Deque has only three operations.
  103. Each mark worker has a single deque.
  104. Only the owner can call pop() and push().
  105. Worker can call shift() to steal other workers' deque.
  106. "Hey wait a minute, doesn't shift() have contention problems?"
  107. In what ways could shift() cause contention problems? e.g... ✓ Multi-thread (workers) may call shift() of same deque at the same time. 122/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  108. In what ways could shift() cause contention problems? e.g... ✓ shift() and pop() could be called at the same time ✓ when deque has only one element. 123/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  109. But, Arora's Deque avoids these contention problems.
  110. Serialization ✓ shift() is serialized by using CAS. ✓ CAS = Compare And Swap ✓ And, this serialization doesn't use a lock. ✓ It's wait-free!! 125/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  111. I omit details of the implementation of the serialization.
  112. For the sake of this presentation, let's assume that Arora's Deque avoids contention problems.
  113. Summary for Arora's Deque ✓ A simple data structure for Task Stealing. ✓ Each worker has a single deque. ✓ Stealing (shift operation) is wait- free! 128/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  114. How to use Arora's Deque in Parallel Marking?
  115. First try: A task is an object.
  116. Let's say that worker A has a branch that is composed of 4 objects.
  117. We start by marking A and pushing it to the deque.
  118. pop A, mark B and C, push B and C.
  119. pop C, mark D, push D
  120. pop D, pop B
  121. This is a branch marking.
  122. How do you steal?
  123. Suppose that worker1 has task B and C. Worker2 has no task.
  124. Worker2 steals task B on Worker1 by using shift().
  125. Summary ✓ Marker uses Arora's Deque as a marking stack. ✓ A "task" means an object. ✓ The granularity of the task is very fine. ✓ This is a naive implementation. 140/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  126. I implemented this approach.
  127. But..
  128. It's slower than original GC.
  129. OMG... http://www.flickr.com/photos/emariephotos/4958245676/
  130. I fell into the Pitfalls of Parallel Processing (PPP!!!)
  131. Why slow?
  132. Why slow? ✓ pop(),push(),shift() are called frequently. ✓ Because deque has fine-grained tasks. ✓ Their overhead is too big. 147/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  133. How to fix this?
  134. We can make the tasks less fine-grained.
  135. A task is a branch
  136. All branches in Roots are divided roughly among the deques.
  137. Each Worker marks a branch in its deque.
  138. When the deque is empty, the worker steals a branch from another worker.
  139. like this!!
  140. Good point & Bad point ✓ Number of calls to Deque's operations was reduced. ✓ Marking speed of the worker is improved. ✓ However, Coarse-grained tasks decrease parallelism. 155/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  141. Why do coarse-grained tasks decrease parallelism?
  142. Tasks may involve a large branch.
  143. If an object in B's branch has many child objects..
  144. .. then A can't steal it while B is marking the large branch.
  145. So, the worker needs to treat large branches as special cases.
  146. Almost all large branches hold large Array objects and/or large Hash objects.
  147. Treatment for large Array objects and Hash objects ✓ Each marker has a special deque to manage them. ✓ A marker divides them into fixed size tasks. ✓ e.g. 0-9 elements of Array, 10-19 elements of Array... 162/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  148. Treatment for Large Array and Hash ✓ By doing this, other workers can steal divided tasks. ✓ This improves parallelism. 163/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  149. Summary ✓ The naive implementation was slow. ✓ Grain of the task was too fine. ✓ A "task" means a branch in Roots ✓ Grain of the task is coarse. ✓ It's faster!! 164/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  150. Today's topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 165/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  151. How much did performance improve?
  152. These are my machine specs ✓ My machine has only 2 cores ✓ Memory: 8GB ✓ OS: Linux 167/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  153. Parallel marking uses 4 marking threads.
  154. First benchmark program is ✓ make benchmark ✓ This is the benchmark which used in CRuby development 169/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  155. Why does this seem so slow? ✓ I think it's affected by Parallel Marking's preparation. ✓ e.g. creating marking threads, allocation of deques. 171/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  156. Why does this seem so slow? ✓ In most of the benchmarks, the mark target objects are few. ✓ In this case, Parallel Marking cost is expensive. 172/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  157. Next benchmark program is ✓ make rdoc ✓ make rdoc generates the Ruby documentation. ✓ This benchmark measures execution time and the GC execution time of make rdoc. 173/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  158. make rdoc ✓ It takes about 80 seconds on my machine. ✓ In fact, 30% of that time is spent on GC!! ✓ How much did performance improve? 174/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  159. All GC time is improved by 40%!
  160. So fast!!
  161. In many core environment ✓ I expect we get a large improvement. ✓ e.g. 8 core, 16 core... ✓ But, my machine has just 2 cores. ✓ I can't see it :( 178/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  162. Best case for Parallel GC ✓ If the objects are many. ✓ In this case, mark targets is also many. ✓ If the objects are long-lived. ✓ Server-side application? 179/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  163. Demo
  164. Demonstration ✓ I want to show the performance improvement with Parallel GC. ✓ This demonstration is video game style. 181/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  165. Let me explain about this game.
  166. And, Character has HP.
  167. When GC runs,
  168. the character loses HP while waiting for the GC to finish.
  169. We must reach the goal before HP run out.
  170. Other characteristics of SUPER NARIO GC ✓ GC is running in fixed intervals. ✓ A lot of objects are generated to increase GC's burden. ✓ Burden = Game Level 187/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  171. Try to compare Original GC and Parallel GC ✓ Original GC pause time is long. ✓ This game will be difficult. ✓ Parallel GC pause time is short. ✓ This game will be easy. 188/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  172. OK, Let's try!
  173. DEMO Original GC version
  174. Oops.. so difficult!!!
  175. DEMO Parallel GC version
  176. Wow!! Easy!!!!
  177. Let's compare average times GC
  178. Fast!!
  179. Remaining Problems
  180. Windows OS is not supported ✓ Mark Worker uses pthread as native thread. ✓ And, uses some gcc built-in functions. ✓ But, I'll support for Windows eventually. 198/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  181. Increased memory usage. ✓ Size of 1 Deque is roughly 32KB. ✓ But generally multi-core machine have plenty of memory. ✓ So, I think it's OK :P 199/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  182. Conclusion
  183. Conclusion ✓ I implemented Parallel Marking GC ✓ GC was improved! ✓ I'll report to ruby-core soon. 201/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  184. Conclusion ✓ But, Parallel Marking has some problems. ✓ I'll fix these. 202/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  185. source code ✓ Parallel Marking GC ✓ <URL:https://github.com/authorNari/ ruby/tree/pmark_div_root2> ✓ SUPER NARIO GC ✓ <URL:https://github.com/authorNari/ nario/> 203/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  186. Acknowledgments ✓ Following people helped me make this presentation!! ✓ Tor-san!! ✓ matz, shugo, yhara, sada, takaokouji, other co-workers!! 204/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
  187. Thank you!!!
  188. Do you have any questions? Please short and simple questions :)
  189. Sorry ✓ It's too difficult for me to understand/answer the question. ✓ Could be send the question on twitter(@nari_en)? 207/207 Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

×