9. Ice-cream factory
✓ I worked in an assembly line
✓ For example, I made many
cardboard boxes.
✓ I was a professional cardboard box
maker :)
8/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
10. Ice-cream factory
✓ I made 150 boxes per hour
(ZOMG)
9/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
11. I was like a machine!!
http://www.flickr.com/photos/kevincollins123/5887984753/
12.
13.
14. Working with Java
✓ I worked in a big company.
✓ This work was similar to
assembly line work..
✓ I made a part of a product. I didn't
understand whole product.
13/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
15. I was still like a
machine!!
http://www.flickr.com/photos/kevincollins123/5887984753/
16.
17.
18. My current work
✓ Currently, I work at NaCl.
✓ matz and shyouhei and takaokouji
are my co-workers.
✓ shugo is my boss.
✓ They are CRuby committers.
17/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
19. When I started Ruby
programming
✓ I felt free.
✓ This work wasn't similar to
assembly line work.
✓ I could make the whole product.
18/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
20. I was no longer
a machine!!
http://www.flickr.com/photos/danzden/121379782/
21.
22. Garbage Collection for me
✓ GC technology is very interesting
for me.
✓ GC is a garbage collecting
machine.
✓ I've been creating it since then.
It's very fun!!
21/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
31. My RDD history
✓ LazySweepGC - RubyKaigi2008
✓ LonglifeGC - 2009
✓ LazySweepGC - 2010
✓ ParallelMarkingGC - 2011
30/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
32. My RDD history
✓ LazySweepGC - RubyKaigi2008
✓ LonglifeGC - 2009
✓ LazySweepGC - 2010
✓ ParallelMarkingGC - 2011
31/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
33. LonglifeGC
✓ It treats long-life objects as a
special case.
✓ similar to Generational GC.
✓ LonglifeGC was rejected in
CRuby 1.9.2 by some reason.
✓ :'(
32/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
34. But, LonglifeGC has
been
used in Kiji :-)
http://www.flickr.com/photos/conifer/2389654222/
35. Kiji
✓ Kiji is an optimized version of
REE by Twitter developers.
✓ The twitter team substantially
extended LonglifeGC.
✓ It's cool!!
34/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
37. My RDD history
✓ LazySweepGC - RubyKaigi2008
✓ LonglifeGC - 2009
✓ LazySweepGC - 2010
✓ ParallelMarkingGC - 2011
36/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
38. LazySweepGC
✓ Traditional M&S GC executes
mark and sweep atomically.
✓ Ruby application stops during GC
(stop-the-world).
✓ In Lazy sweeping, sweeping is
lazy.
37/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
39. LazySweepGC
✓ Each invocation of the object
allocation sweeps Ruby's heap
✓ until it finds an appropriate free object.
38/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
40. Improvements
✓ This improves the response time
of GC
✓ I.e. the worst case time of GC
decreases.
39/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
41. LazySweepGC
✓ You can use LazySweepGC since
Ruby 1.9.3
40/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
42. My RDD history
✓ LazySweepGC - RubyKaigi2008
✓ LonglifeGC - 2009
✓ LazySweepGC - 2010
✓ ParallelMarkingGC - 2011
41/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
44. Today's topics
✓ Why do we need Parallel
Marking?
✓ What to consider?
✓ How to implement?
✓ How much did performance
improve?
43/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
45. Today's topics
✓ Why do we need Parallel
Marking?
✓ What to consider?
✓ How to implement?
✓ How much did performance
improve?
44/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
48. Current CRuby's GC
✓ GC operates on only 1 core.
✓ In multi-core environment, other
cores don't help GC.
47/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
49. GC:"I'm alone,
it's so hard."
http://www.flickr.com/photos/hortont/2698261070/
50. We should run GC in
parallel!!
http://www.flickr.com/photos/knallaerbse/2863161933/
52. What is GC?
✓ GC collects all dead objects.
51/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
53. What is a dead object?
✓ A dead object is an object that is
never referenced by the program.
✓ In GC terms, we say a that dead
object is unreachable from Roots.
52/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
54. What is Roots?
✓ Roots is a set of pointers that
directly reference objects in the
program.
✓ e.g. Ruby's local variables, etc..
53/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
55. For example
54/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
56. Please remember that
✓ GC collects objects that are
unreachable from Roots.
55/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
57. Next, Let me explain the
current CRuby GC
algorithm.
58. CRuby's GC algorithm
summary
✓ CRuby adopts the Mark & Sweep
algorithm
✓ Collector works in separate Mark
and Sweep phases.
57/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
59. In the Mark phase
✓ collector marks live objects that
are reachable from Roots.
58/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
60. For example
59/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
61. Mark phase with GC.start
60/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
62. Ruby Heap after marking
61/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
63. In the Sweep phase
✓ collector sweeps "dead" objects
✓ "dead" means unmarked
✓ "dead" means unreachable from Roots
62/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
64. Sweep phase
63/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
66. Characteristics
✓ The stop-the-world algorithm
✓ Single thread execution
65/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
67. Recently, PC has multi-core
processors. But,
✓ GC executes on a single thread.
✓ Other cores don't work during GC.
✓ What a waste!!
66/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
71. What is Parallel Marking?
✓ Collector run several marking
processes in parallel
✓ by using native threads.
✓ We will be happy on multi-core
machine.
70/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
72. Flow diagram for Parallel
Marking
71/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
74. Why not perform sweeping in
parallel
✓ The sweeping is much faster than
the marking.
✓ You can see ko1's research
✓ <URL:http://www.atdot.net/~ko1/
diary/201011.html#d4>
73/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
75. Why not perform sweeping in
parallel
✓ So, Mark phase improvement =
GC improvement
✓ And, we already have the lazy
sweeping.
74/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
76. Today's topics
✓ Why do we need Parallel
Marking?
✓ What to consider?
✓ How to implement?
✓ How much did performance
improve?
75/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
85. This means..
✓ Tasks are distributed to multiple
threads.
✓ The task of marking the entire
heap is divided into several tasks,
each marking a single branch.
84/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
100. What does "wait-free" mean?
✓ A wait-free program does non-
blocking execution.
✓ It guarantees per-thread progress.
99/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
103. Amdahl's law
is used to find the
maximum expected
improvement to an
overall system when
only part of the system
is improved.
[cited from `Amdahl's law - Wikipedia']
102/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
104. Amdahl's law is used in
parallel computing
✓ If parallel portion of the system is
X%
✓ And number of processors is Y,
✓ How much speedup can we
expect?
103/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
110. The conclusion so far
✓ We should consider how we can
efficiently balance workloads.
✓ So, we use Task Stealing.
✓ We should eliminate non-parallel
parts
✓ by using wait-free algorithm.
109/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
111. Today's topics
✓ Why do we need Parallel
Marking?
✓ What to consider?
✓ How to implement?
✓ How much did performance
improve
110/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
113. Task Stealing
✓ In Task Stealing, threads steal
tasks from each other
✓ Task Stealing is achieved with
Arora's Deque
112/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
114. Arora's Deque
✓ Deque stands for the Double-
Ended Queue.
✓ In Arora's Deque, the deque
contains tasks as elements.
✓ It's a wait-free data structure.
113/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
122. "Hey wait a minute,
doesn't shift() have
contention problems?"
123. In what ways could shift()
cause contention problems?
e.g...
✓ Multi-thread (workers) may call
shift() of same deque at the same
time.
122/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
124. In what ways could shift()
cause contention problems?
e.g...
✓ shift() and pop() could be called
at the same time
✓ when deque has only one element.
123/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
126. Serialization
✓ shift() is serialized by using CAS.
✓ CAS = Compare And Swap
✓ And, this serialization doesn't use
a lock.
✓ It's wait-free!!
125/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
127. I omit details of the
implementation of the
serialization.
128. For the sake of this
presentation, let's assume
that Arora's Deque avoids
contention problems.
129. Summary for Arora's Deque
✓ A simple data structure for Task
Stealing.
✓ Each worker has a single deque.
✓ Stealing (shift operation) is wait-
free!
128/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
130. How to use Arora's Deque
in Parallel Marking?
141. Summary
✓ Marker uses Arora's Deque as a
marking stack.
✓ A "task" means an object.
✓ The granularity of the task is very fine.
✓ This is a naive implementation.
140/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
148. Why slow?
✓ pop(),push(),shift() are called
frequently.
✓ Because deque has fine-grained tasks.
✓ Their overhead is too big.
147/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
156. Good point & Bad point
✓ Number of calls to Deque's
operations was reduced.
✓ Marking speed of the worker is
improved.
✓ However, Coarse-grained tasks
decrease parallelism.
155/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
159. If an object in B's branch has many child
objects..
160. .. then A can't steal it while B is marking
the large branch.
161. So, the worker needs to
treat large branches as
special cases.
162. Almost all large branches
hold large Array objects
and/or large Hash objects.
163. Treatment for large Array
objects and Hash objects
✓ Each marker has a special deque
to manage them.
✓ A marker divides them into fixed
size tasks.
✓ e.g. 0-9 elements of Array, 10-19
elements of Array...
162/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
164. Treatment for Large Array
and Hash
✓ By doing this, other workers can
steal divided tasks.
✓ This improves parallelism.
163/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
165. Summary
✓ The naive implementation was
slow.
✓ Grain of the task was too fine.
✓ A "task" means a branch in Roots
✓ Grain of the task is coarse.
✓ It's faster!!
164/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
166. Today's topics
✓ Why do we need Parallel
Marking?
✓ What to consider?
✓ How to implement?
✓ How much did performance
improve?
165/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
170. First benchmark program is
✓ make benchmark
✓ This is the benchmark which used in
CRuby development
169/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
171.
172. Why does this seem so slow?
✓ I think it's affected by Parallel
Marking's preparation.
✓ e.g. creating marking threads,
allocation of deques.
171/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
173. Why does this seem so slow?
✓ In most of the benchmarks, the
mark target objects are few.
✓ In this case, Parallel Marking cost is
expensive.
172/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
174. Next benchmark program is
✓ make rdoc
✓ make rdoc generates the Ruby
documentation.
✓ This benchmark measures execution
time and the GC execution time of
make rdoc.
173/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
175. make rdoc
✓ It takes about 80 seconds on my
machine.
✓ In fact, 30% of that time is spent
on GC!!
✓ How much did performance
improve?
174/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
179. In many core environment
✓ I expect we get a large
improvement.
✓ e.g. 8 core, 16 core...
✓ But, my machine has just 2 cores.
✓ I can't see it :(
178/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
180. Best case for Parallel GC
✓ If the objects are many.
✓ In this case, mark targets is also many.
✓ If the objects are long-lived.
✓ Server-side application?
179/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
182. Demonstration
✓ I want to show the performance
improvement with Parallel GC.
✓ This demonstration is video game
style.
181/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
188. Other characteristics of
SUPER NARIO GC
✓ GC is running in fixed intervals.
✓ A lot of objects are generated to
increase GC's burden.
✓ Burden = Game Level
187/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
189. Try to compare Original GC
and Parallel GC
✓ Original GC pause time is long.
✓ This game will be difficult.
✓ Parallel GC pause time is short.
✓ This game will be easy.
188/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
199. Windows OS is not supported
✓ Mark Worker uses pthread as
native thread.
✓ And, uses some gcc built-in
functions.
✓ But, I'll support for Windows
eventually.
198/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
200. Increased memory usage.
✓ Size of 1 Deque is roughly 32KB.
✓ But generally multi-core machine
have plenty of memory.
✓ So, I think it's OK :P
199/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
202. Conclusion
✓ I implemented Parallel Marking
GC
✓ GC was improved!
✓ I'll report to ruby-core soon.
201/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
203. Conclusion
✓ But, Parallel Marking has some
problems.
✓ I'll fix these.
202/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
204. source code
✓ Parallel Marking GC
✓ <URL:https://github.com/authorNari/
ruby/tree/pmark_div_root2>
✓ SUPER NARIO GC
✓ <URL:https://github.com/authorNari/
nario/>
203/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
205. Acknowledgments
✓ Following people helped me
make this presentation!!
✓ Tor-san!!
✓ matz, shugo, yhara, sada, takaokouji,
other co-workers!!
204/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
207. Do you have any
questions?
Please short and simple
questions :)
208. Sorry
✓ It's too difficult for me to
understand/answer the question.
✓ Could be send the question on
twitter(@nari_en)?
207/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3