Parallel worlds of CRuby's GC

  • 20,966 views
Uploaded on

I talked this presentation at rubyconf 2011. yay!

I talked this presentation at rubyconf 2011. yay!

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
20,966
On Slideshare
0
From Embeds
0
Number of Embeds
10

Actions

Shares
Downloads
48
Comments
1
Likes
20

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Parallel worlds of CRubys GC nari/Narihiro Nakamura/ @nari_en Network Applied Communication Laboratory Ltd.Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 2. Im very happy now.
  • 3. Today is my firstpresentation in English.
  • 4. My English is not good.
  • 5. But, Ill do my best.Please bear with me :)
  • 6. Self introduction
  • 7. Ice-cream factory ✓ I worked in an assembly line ✓ For example, I made many cardboard boxes. ✓ I was a professional cardboard box maker :) 8/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 8. Ice-cream factory ✓ I made 150 boxes per hour (ZOMG) 9/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 9. I was like a machine!! http://www.flickr.com/photos/kevincollins123/5887984753/
  • 10. Working with Java ✓ I worked in a big company. ✓ This work was similar to assembly line work.. ✓ I made a part of a product. I didnt understand whole product. 13/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 11. I was still like a machine!! http://www.flickr.com/photos/kevincollins123/5887984753/
  • 12. My current work ✓ Currently, I work at NaCl. ✓ matz and shyouhei and takaokouji are my co-workers. ✓ shugo is my boss. ✓ They are CRuby committers. 17/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 13. When I started Ruby programming ✓ I felt free. ✓ This work wasnt similar to assembly line work. ✓ I could make the whole product. 18/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 14. I was no longer a machine!! http://www.flickr.com/photos/danzden/121379782/
  • 15. Garbage Collection for me ✓ GC technology is very interesting for me. ✓ GC is a garbage collecting machine. ✓ Ive been creating it since then. Its very fun!! 21/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 16. Im making a machine!!
  • 17. My relationship to GC
  • 18. Im a CRuby Committer ✓ I work on GC. 24/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 19. And, I wrote abook about GC.
  • 20. But, its only in Japanese :(
  • 21. And, Ive been creating GC with RDD.
  • 22. What is RDD?
  • 23. RDD = RubyKaigi Driven Development
  • 24. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 30/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 25. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 31/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 26. LonglifeGC ✓ It treats long-life objects as a special case. ✓ similar to Generational GC. ✓ LonglifeGC was rejected in CRuby 1.9.2 by some reason. ✓ :( 32/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 27. But, LonglifeGC has been used in Kiji :-) http://www.flickr.com/photos/conifer/2389654222/
  • 28. Kiji ✓ Kiji is an optimized version of REE by Twitter developers. ✓ The twitter team substantially extended LonglifeGC. ✓ Its cool!! 34/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 29. But, Kiji will be rejected also... :(
  • 30. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 36/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 31. LazySweepGC ✓ Traditional M&S GC executes mark and sweep atomically. ✓ Ruby application stops during GC (stop-the-world). ✓ In Lazy sweeping, sweeping is lazy. 37/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 32. LazySweepGC ✓ Each invocation of the object allocation sweeps Rubys heap ✓ until it finds an appropriate free object. 38/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 33. Improvements ✓ This improves the response time of GC ✓ I.e. the worst case time of GC decreases. 39/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 34. LazySweepGC ✓ You can use LazySweepGC since Ruby 1.9.3 40/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 35. My RDD history ✓ LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 41/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 36. Todays topics
  • 37. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 43/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 38. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 44/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 39. Why do we need Parallel Marking?
  • 40. This is CRubys current GC.
  • 41. Current CRubys GC ✓ GC operates on only 1 core. ✓ In multi-core environment, other cores dont help GC. 47/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 42. GC:"Im alone, its so hard." http://www.flickr.com/photos/hortont/2698261070/
  • 43. We should run GC in parallel!! http://www.flickr.com/photos/knallaerbse/2863161933/
  • 44. First, Let me explain afew GC related concepts.
  • 45. What is GC? ✓ GC collects all dead objects. 51/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 46. What is a dead object? ✓ A dead object is an object that is never referenced by the program. ✓ In GC terms, we say a that dead object is unreachable from Roots. 52/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 47. What is Roots? ✓ Roots is a set of pointers that directly reference objects in the program. ✓ e.g. Rubys local variables, etc.. 53/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 48. For example 54/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 49. Please remember that ✓ GC collects objects that are unreachable from Roots. 55/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 50. Next, Let me explain the current CRuby GC algorithm.
  • 51. CRubys GC algorithm summary ✓ CRuby adopts the Mark & Sweep algorithm ✓ Collector works in separate Mark and Sweep phases. 57/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 52. In the Mark phase ✓ collector marks live objects that are reachable from Roots. 58/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 53. For example 59/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 54. Mark phase with GC.start 60/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 55. Ruby Heap after marking 61/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 56. In the Sweep phase ✓ collector sweeps "dead" objects ✓ "dead" means unmarked ✓ "dead" means unreachable from Roots 62/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 57. Sweep phase 63/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 58. Characteristics of CRubys GC
  • 59. Characteristics ✓ The stop-the-world algorithm ✓ Single thread execution 65/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 60. Recently, PC has multi-core processors. But, ✓ GC executes on a single thread. ✓ Other cores dont work during GC. ✓ What a waste!! 66/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 61. How can we fix this?
  • 62. UseParallel Marking,Luke
  • 63. What is Parallel Marking?
  • 64. What is Parallel Marking? ✓ Collector run several marking processes in parallel ✓ by using native threads. ✓ We will be happy on multi-core machine. 70/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 65. Flow diagram for Parallel Marking 71/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 66. BTW: Why not performsweeping in parallel?
  • 67. Why not perform sweeping in parallel ✓ The sweeping is much faster than the marking. ✓ You can see ko1s research ✓ <URL:http://www.atdot.net/~ko1/ diary/201011.html#d4> 73/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 68. Why not perform sweeping in parallel ✓ So, Mark phase improvement = GC improvement ✓ And, we already have the lazy sweeping. 74/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 69. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 75/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 70. What to consider whenimplementing Parallel Marking?
  • 71. We should consider two problems ✓ Workload balancing ✓ Wait-free algorithm 77/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 72. Workload balancing
  • 73. How can we divide themarking task into sub- tasks?
  • 74. I tried think about a simple approach.
  • 75. 1 branch of Roots ismarked by 1 thread.
  • 76. This means.. ✓ Tasks are distributed to multiple threads. ✓ The task of marking the entire heap is divided into several tasks, each marking a single branch. 84/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 77. This seems to be no problem.
  • 78. But actually, this solutionsuffers from the workload problem.
  • 79. Each thread doesnt know what the other threads are doing.
  • 80. For instance, if A and B finishes work early,
  • 81. then, they will stop doing anything :(
  • 82. I think "machines should work forever" :D
  • 83. So, I think A and B should ...
  • 84. http://www.flickr.com/photos/ryanr/157458385/
  • 85. Parallel Marking with Task Stealing.
  • 86. If A and B finishes work early,
  • 87. This is called"Task Stealing"
  • 88. We should consider two problems ✓ Workload balancing ✓ Wait-free algorithm 97/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 89. Wait-free algorithm
  • 90. What does "wait-free" mean? ✓ A wait-free program does non- blocking execution. ✓ It guarantees per-thread progress. 99/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 91. Why is wait-free important?
  • 92. Amdahls law
  • 93. Amdahls law is used to find the maximum expected improvement to an overall system when only part of the system is improved. [cited from `Amdahls law - Wikipedia] 102/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 94. Amdahls law is used in parallel computing ✓ If parallel portion of the system is X% ✓ And number of processors is Y, ✓ How much speedup can we expect? 103/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 95. Its worse than expected, right?
  • 96. The conclusion so far
  • 97. The conclusion so far ✓ We should consider how we can efficiently balance workloads. ✓ So, we use Task Stealing. ✓ We should eliminate non-parallel parts ✓ by using wait-free algorithm. 109/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 98. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve 110/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 99. How to implementParallel Marking?
  • 100. Task Stealing ✓ In Task Stealing, threads steal tasks from each other ✓ Task Stealing is achieved with Aroras Deque 112/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 101. Aroras Deque ✓ Deque stands for the Double- Ended Queue. ✓ In Aroras Deque, the deque contains tasks as elements. ✓ Its a wait-free data structure. 113/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 102. Aroras Deque has only three operations.
  • 103. Each mark worker has a single deque.
  • 104. Only the owner can call pop() and push().
  • 105. Worker can call shift() to steal other workers deque.
  • 106. "Hey wait a minute, doesnt shift() havecontention problems?"
  • 107. In what ways could shift() cause contention problems? e.g... ✓ Multi-thread (workers) may call shift() of same deque at the same time. 122/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 108. In what ways could shift() cause contention problems? e.g... ✓ shift() and pop() could be called at the same time ✓ when deque has only one element. 123/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 109. But, Aroras Deque avoidsthese contention problems.
  • 110. Serialization ✓ shift() is serialized by using CAS. ✓ CAS = Compare And Swap ✓ And, this serialization doesnt use a lock. ✓ Its wait-free!! 125/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 111. I omit details of theimplementation of the serialization.
  • 112. For the sake of thispresentation, lets assumethat Aroras Deque avoids contention problems.
  • 113. Summary for Aroras Deque ✓ A simple data structure for Task Stealing. ✓ Each worker has a single deque. ✓ Stealing (shift operation) is wait- free! 128/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 114. How to use Aroras Deque in Parallel Marking?
  • 115. First try:A task is an object.
  • 116. Lets say that worker A has a branch that is composed of 4 objects.
  • 117. We start by marking A and pushing it to the deque.
  • 118. pop A, mark B and C, push B and C.
  • 119. pop C, mark D, push D
  • 120. pop D, pop B
  • 121. This is a branch marking.
  • 122. How do you steal?
  • 123. Suppose that worker1 has task B and C. Worker2 has no task.
  • 124. Worker2 steals task B on Worker1 by using shift().
  • 125. Summary ✓ Marker uses Aroras Deque as a marking stack. ✓ A "task" means an object. ✓ The granularity of the task is very fine. ✓ This is a naive implementation. 140/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 126. I implemented this approach.
  • 127. But..
  • 128. Its slowerthan original GC.
  • 129. OMG...http://www.flickr.com/photos/emariephotos/4958245676/
  • 130. I fell intothe Pitfalls ofParallel Processing(PPP!!!)
  • 131. Why slow?
  • 132. Why slow? ✓ pop(),push(),shift() are called frequently. ✓ Because deque has fine-grained tasks. ✓ Their overhead is too big. 147/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 133. How to fix this?
  • 134. We can make the tasks less fine-grained.
  • 135. A task is a branch
  • 136. All branches in Roots are divided roughly among the deques.
  • 137. Each Worker marks a branch in its deque.
  • 138. When the deque is empty, the workersteals a branch from another worker.
  • 139. like this!!
  • 140. Good point & Bad point ✓ Number of calls to Deques operations was reduced. ✓ Marking speed of the worker is improved. ✓ However, Coarse-grained tasks decrease parallelism. 155/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 141. Why do coarse-grained tasks decrease parallelism?
  • 142. Tasks may involve a large branch.
  • 143. If an object in Bs branch has many child objects..
  • 144. .. then A cant steal it while B is marking the large branch.
  • 145. So, the worker needs totreat large branches as special cases.
  • 146. Almost all large branches hold large Array objectsand/or large Hash objects.
  • 147. Treatment for large Array objects and Hash objects ✓ Each marker has a special deque to manage them. ✓ A marker divides them into fixed size tasks. ✓ e.g. 0-9 elements of Array, 10-19 elements of Array... 162/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 148. Treatment for Large Array and Hash ✓ By doing this, other workers can steal divided tasks. ✓ This improves parallelism. 163/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 149. Summary ✓ The naive implementation was slow. ✓ Grain of the task was too fine. ✓ A "task" means a branch in Roots ✓ Grain of the task is coarse. ✓ Its faster!! 164/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 150. Todays topics ✓ Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? 165/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 151. How much didperformance improve?
  • 152. These are my machine specs ✓ My machine has only 2 cores ✓ Memory: 8GB ✓ OS: Linux 167/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 153. Parallel marking uses 4 marking threads.
  • 154. First benchmark program is ✓ make benchmark ✓ This is the benchmark which used in CRuby development 169/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 155. Why does this seem so slow? ✓ I think its affected by Parallel Markings preparation. ✓ e.g. creating marking threads, allocation of deques. 171/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 156. Why does this seem so slow? ✓ In most of the benchmarks, the mark target objects are few. ✓ In this case, Parallel Marking cost is expensive. 172/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 157. Next benchmark program is ✓ make rdoc ✓ make rdoc generates the Ruby documentation. ✓ This benchmark measures execution time and the GC execution time of make rdoc. 173/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 158. make rdoc ✓ It takes about 80 seconds on my machine. ✓ In fact, 30% of that time is spent on GC!! ✓ How much did performance improve? 174/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 159. All GC time is improved by 40%!
  • 160. So fast!!
  • 161. In many core environment ✓ I expect we get a large improvement. ✓ e.g. 8 core, 16 core... ✓ But, my machine has just 2 cores. ✓ I cant see it :( 178/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 162. Best case for Parallel GC ✓ If the objects are many. ✓ In this case, mark targets is also many. ✓ If the objects are long-lived. ✓ Server-side application? 179/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 163. Demo
  • 164. Demonstration ✓ I want to show the performance improvement with Parallel GC. ✓ This demonstration is video game style. 181/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 165. Let me explain about this game.
  • 166. And, Character has HP.
  • 167. When GC runs,
  • 168. the character loses HP while waiting for the GC to finish.
  • 169. We must reach the goal before HP run out.
  • 170. Other characteristics of SUPER NARIO GC ✓ GC is running in fixed intervals. ✓ A lot of objects are generated to increase GCs burden. ✓ Burden = Game Level 187/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 171. Try to compare Original GC and Parallel GC ✓ Original GC pause time is long. ✓ This game will be difficult. ✓ Parallel GC pause time is short. ✓ This game will be easy. 188/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 172. OK, Lets try!
  • 173. DEMOOriginal GC version
  • 174. Oops.. so difficult!!!
  • 175. DEMOParallel GC version
  • 176. Wow!! Easy!!!!
  • 177. Lets compare average times GC
  • 178. Fast!!
  • 179. Remaining Problems
  • 180. Windows OS is not supported ✓ Mark Worker uses pthread as native thread. ✓ And, uses some gcc built-in functions. ✓ But, Ill support for Windows eventually. 198/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 181. Increased memory usage. ✓ Size of 1 Deque is roughly 32KB. ✓ But generally multi-core machine have plenty of memory. ✓ So, I think its OK :P 199/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 182. Conclusion
  • 183. Conclusion ✓ I implemented Parallel Marking GC ✓ GC was improved! ✓ Ill report to ruby-core soon. 201/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 184. Conclusion ✓ But, Parallel Marking has some problems. ✓ Ill fix these. 202/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 185. source code ✓ Parallel Marking GC ✓ <URL:https://github.com/authorNari/ ruby/tree/pmark_div_root2> ✓ SUPER NARIO GC ✓ <URL:https://github.com/authorNari/ nario/> 203/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 186. Acknowledgments ✓ Following people helped me make this presentation!! ✓ Tor-san!! ✓ matz, shugo, yhara, sada, takaokouji, other co-workers!! 204/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3
  • 187. Thank you!!!
  • 188. Do you have any questions?Please short and simple questions :)
  • 189. Sorry ✓ Its too difficult for me to understand/answer the question. ✓ Could be send the question on twitter(@nari_en)? 207/207Parallel worlds of CRubys GC Powered by Rabbit 0.9.3