Grow and Shrink - Dynamically Extending the Ruby VM Stack
1. Grow and Shrink -
Dynamically Extending the
Ruby VM Stack
Ruby Kaigi 2018, Sendai, Japan
June 2nd, 2018
Keita Sugiyama (杉山敬太) and Martin J. Dürst
Aoyama Gakuin University (青山学院大学)
2. Speaker Introduction
• Keita Sugiyama (杉山 敬太)
Master Course Student
Intelligence and Information Course
Graduate School of Science and Engineering
Aoyama Gakuin University
(青山学院大学大学院理工学研究科知能情報コース M1)
• Martin J. Dürst
Department of Integrated Information Technology
College of Science and Engineering
Aoyama Gakuin University
(青山学院大学理工学部情報テクノロジー学科教授)
2
Looking for a summer
internship / 夏休みのイ
ンターンシップ検討中
3. Software Laboratory:
Past Contributions to Ruby
• Character encoding conversion
(String#encode, 2007)
• Unicode normalization
(String#normalize, 2014)
• Unicode upcase/downcase
(String#upcase,…, 2016)
• Update Unicode version
(Unicode 11.0.0 coming soon, see
https://bugs.ruby-lang.org/issues/14802)
3
4. Motivation: Multithread Ruby
• Concurrency ever more important
• Multi-core, languages such as Go and Elixir
• Efforts to make concurrency easier
in Ruby and MRI:
• Fibers, lazy, Guilds, MVM, thread cache,…
4
Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software.
http://www.gotw.ca/publications/concurrency-ddj.htm (viewed 2018/1/30).
耕一笹田, 行弘松本. Ruby 3 に向けた新しい並行実行モデルの提案. 情報処理学会論文誌プログラ
ミング (PRO), Vol. 10, No. 3, pp. 16–16, 06/16 2017.
5. Multithreading:
Memory Consumption
• Each Ruby VM thread needs a stack
• Stacks are large, fixed size (1MB)
to avoid stack overflow
• Many threads → many stacks → lots of memory
5
Dynamic extension of VM stacks
to save memory
6. For Code Geeks
• Implementation is published at:
github.com/sugiyama-k/ruby/tree/chaining
6
7. Outline
• Motivation
• The Ruby VM stacks
• Two extension methods:
Stretching and Chaining
• Safe and efficient development tips
• Experimental results
• Conclusions
7
9. MRI
• Matz’s Ruby Interpreter
• Also called CRuby
• Ruby’s reference implementation
• Latest Stable Version: Ruby 2.5.1
• This work based on revision 60436
9
松本行弘ほか. オブジェクト指向スクリプト言語 Ruby. https://www.ruby-lang.org/ja/.
(2018/2/26 閲覧).
10. The Ruby Virtual Machine
• YARV(Yet Another Ruby VM)
• Developed by Koichi Sasada et al.
• Introduced with Ruby 1.9
• Uses two stacks
10
笹田耕一, 松本行弘, 前田敦司, 並木美太郎. Ruby 用仮想マシン YARV の実装と評価.
情報処理学会論文誌プログラミング(PRO), Vol. 47, No. SIG2(PRO28), pp. 57–73,
Feb 2006.
Dave Thomas, Chad Fowler, and Andy Hunt. Programming Ruby 1.9, Section 25.6.
The Pragmatic Programmers, 2009.
Pat Shaughnessy. Ruby Under a Microscope: An Illustrated Guide to Ruby Internals.
No Starch Press, 2013.
11. Ruby VM Stacks
11
lower addresses
higher addresses
overflow happens when
the two stacks meet
Call Stack
Internal Stack
Control
Frame
Stack Frame
12. Call Stack
• Array of control frames (6 words/48 bytes)
• One frame per invocation
(eval/class/method/block/cfunc)
12
Call Stack
Control
Frame
14. Object#show_stack
• Show stack even when there is no error in
the Ruby interpreter
• Works like Object#tap without a block
• Available as proof-of-concept patch
(https://bugs.ruby-lang.org/issues/14801)
14
15. Control Frame Structure
vm_core.h
15
typedef struct rb_control_frame_struct {
const VALUE *pc; /* program counter */
VALUE *sp; /* stack pointer */
const rb_iseq_t *iseq; /* instruction sequence */
VALUE self; /* self */
const VALUE *ep; /* environment pointer */
const void *block_code; /* ブロックへの命令構造体 */
} rb_control_frame_t;
16. Internal Stack
• One stack frame per control frame
• Variable size, overlapping
• Used for execution of instructions
16
Internal Stack Stack Frame
18. Access to Local Variables
18
Environment
Pointer
access by index
Local Variable
Environment
Data
19. Access to Local Variables
in Outer Scope
19
Frame where
Block is
defined
Block Framespecval points to
EP of outer scope
of block
access by index
Environment
Pointer
Environment
Pointer
Environment
Data
Environment
Data
20. Layout of Arguments
Passed to a Method
20
Argument 2
Argument 1
self
Argument 3
Environment
Data
Environment
Pointer
Environment
Pointer
Stack Pointer
Stack Pointer
Caller
Frame
Callee
Frame
Access to
Arguments from
Environment
Pointer
Arguments
to Method
Environment
Data
22. Execution Context Structure
22
typedef struct rb_execution_context_struct {
/* information about virtual machine */
VALUE *vm_stack; /* vm stack */
size_t vm_stack_size; /* stack size */
rb_control_frame_t *cfp; /* current
control frame */
/*
* omitted
*/
} rb_execution_context_t;
23. History of this Research
• Proposed by Koichi Sasada in 2016
• First attempt in 2016/7 by Sho Koike
only incomplete implementation
• Continued in 2017/8
23
小池翔. Ruby VM におけるスタック領域の拡張の提案と実装. 卒業研究論文,
青山学院大学, 2016.
25. Stack Extension
• When a potential stack overflow is detected,
extend the stack to double its size
• To avoid infinite recursion
• Set a maximum stack size
• If maximum stack size is reached,
a stack overflow is triggered
25
26. Stretching the Stacks
• Keep structure of overall stack
• Allocate new memory area
• Copy call stack and internal stack
• Free old memory area
26
OverflowFree
Allocate
Copy
Copy
27. VM Stack Size
• Default Size: 1MB
About 10’000 recursions
• Use environment variables from trunk to set
maximum stack size
• RUBY_THREAD_VM_STACK_SIZE
• RUBY_FIBER_VM_STACK_SIZE
• Introduce new variables for initial stack size
• RUBY_THREAD_VM_STACK_INITIAL_SIZE
• RUBY_FIBER_VM_STACK_INITIAL_SIZE
• Reduce minimum size and alignment
27
28. Triggering Stack Extension
Change overflow check to call of stack
extension function
28
#define CHECK_VM_STACK_OVERFLOW0(cfp, sp, margin)
if (!(((rb_control_frame_t *)((sp) + (margin)) + 1)
>= (cfp))) {(void)0;}
else vm_stackoverflow()
#define CHECK_VM_STACK_OVERFLOW0(ec, cfp, sp, margin)
if (!(((rb_control_frame_t *)((sp) + (margin)) + 1)
>= (cfp))) {(void)0;}
else vm_stack_try_extend(ec, cfp, sp, margin)
29. Possible Places for Stack Extension
Functions where
CHECK_VM_STACK_OVERFLOW is called:
• invoke_iseq_block_from_c
• setup_parameters_complex
• vm_caller_setup_arg_splat
• vm_call0_body
• vm_push_frame_
• vm_call_method_missing
• vm_callee_setup_block_arg_arg0_splat
• vm_callee_setup_block_arg
29
30. Overview of
Stack Extension Processing
30
smaller than max size? stack overflow
deal with pointers:
• stack-internal
• execution context
stretching the stacks
decide new size
No
Yes
31. Dealing with Pointers to Stacks
Because the stacks move, pointers into the
stacks have to be fixed or changed.
• Stack-internal pointers,
pointers from execution context:
Adjust to point to new location
• ‘Unknown’ pointers:
Change referencing method
• Arguments to C functions:
Copy to C stack
31
32. Change of Referencing Method
• Before: Direct pointers
• After: Offsets from start of stack
• Conversion from offsets to pointers on access
32
offset into call stack
offset into internal stack
33. Conversion between
Pointers and Offsets
33
static inline ptrdiff_t
vm_stack_ptr_save(const rb_execution_context_t *ec,
const VALUE *ptr)
{
return ptr - ec->vm_stack;
}
static inline VALUE *
vm_stack_ptr_restore(const rb_execution_context_t *ec,
ptrdiff_t saved_ptr)
{
return ec->vm_stack + saved_ptr;
}
34. Arguments to C Functions
• Problem:
Passing arguments on Ruby internal stack
to C Functions that do not expect them to move
• Solution:
Copy arguments to separate location
34
/* (in function vm_call_cfunc_with_frame) */
argv = ALLOCA_N(VALUE, argc);
MEMCPY(argv, reg_cfp->sp + 1, VALUE, argc);
val = (*cfunc->invoker)(cfunc->func, recv, argc, argv);
36. A Simple Development Cycle
• Fix something
• Run tests
• Wait for a segmentation fault
• Have no idea why it occurred,
or how to reproduce it
36
37. Problems and Solutions
1. Segmentation faults are too late
→ Prohibit access to old stacks
2. Stack extensions occur too rarely
→ Frequent stack movement
Limited to development:
• Linux only
• #if VM_STACK_USE_MPROTECT
37
38. Prohibit Access to Old Stacks
• Instead of freeing, prohibit access to stack
• Use mprotect function
• Linux system call
• Controls access to memory pages
• Produces segmentation fault immediately
• Quickly discovers stack access that needs fixing
38
Michael Kerrisk. Linux プログラミングインタフェース. オライリー・ジャパン, 2012.
千住治郎 訳
39. Testing Stack Extension
• Problem:
Testing stack extension from a Ruby
program is difficult
• Stack extension occurs rarely
• Cannot trigger stack extension at specific point
in program execution
39
Explicit frequent triggering of stack
extension by implementation
40. Frequent Stack Movement
• Stack overflow is checked at 8 locations in MRI
• Move stack at every overflow check
(extend only if necessary)
• Allows to check that there are no locations
where stack extension may lead to bugs
40
41. Functional Test Results
• Version based on r60436 passes
make test-all
(17,429 tests, 2,232,108 assertions)
• Development version passes most tests,
except:
• Time-limited tests: Okay if limits increased
• One resource-limited tests
• Memory leak tests: Because we do not use free
41
Fully functional implementation
43. Execution Speed
• Using Ruby benchmarks
• Measure basic execution speed
• Influence of offset↔pointer conversions,…
• Stacks at default size, no extensions
43
44. Execution Environment
Each experiment is run 3 times;
the best time is used
44
CPU Intel x86_64 CPU Core i7-6500U 2.50GHz
memory 16GB
OS Gentoo Linux 4.12.5
compiler GCC(Gentoo 6.4.0 p1.1)
Ruby version ruby 2.5.0 dev(r60436)
45. Change of Execution Time
(relative)
45
0
0.5
1
1.5
2
Relativeexecutiontime
Stretching trunk
max Average
1.628 1.185
46. Reasons for Lower Speed
• Indirect access to call stack
• Frequent access to call stack
• VM instruction execution
• Method calls
• Block-related processing
46
Speedup can be expected
if control frames can stay in place
47. New Extension Method:
Chaining
Extension Method Call Stack Internal Stack
Stretching Grows downwards Grows upwards
Chaining Chain of control frames Grows upwards
47
48. Call Stack Chaining
• Call stack is implemented as a chain (linked list) of control
frames
• Stack overflow only happens for internal stack
• Internal stack is moved as before
• Based on Lua’s implementation
48
OverflowFree
Copy
Allocate
R. Ierusalimschy, L. H. de Figueiredo, and W. Celes. The evolution of lua. In Proceedings of the Third ACM SIGPLAN
Conference on History of Programming Languages, HOPL III, pp. 2–1–2–26, New York, NY, USA, 2007. ACM.
50. Why is Chaining Still Slower?
• Moving of internal stack
• Access via offsets
• Copying arguments to C functions
• Dynamic allocation of control frames
Benchmarks with deep recursion tend to be
slower
50
51. More Evaluation Experiments:
Overview
• Change of execution time
• Influence of initial stack size
• Reduction of memory use
• Sleeping threads
• Influence of thread count
• Influence of initial stack size
• Recursive function invocation
to control thread depth
51
52. Change of execution time
when changing initial stack size
• Influence of shrinking initial stack size on
execution time
• Initial stack size
256 Bytes to 1MB, multiplying by 4
52
53. Relationship between Initial
Stack Size and Execution Time
(for Chaining)
53
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Relativeexecutiontime
256B 1KB 4KB 16KB 64KB 256KB 1MB trunk
Initial stack size has almost no influence
on execution time
54. Evaluation of Memory Usage
• Change of memory usage due to stack
extension
• Influence of thread count
• Influence of initial stack size
• Initial stack size
• Baseline: 1MB(default)
• Our implementation: 1KB
54
55. Method of Evaluating
Memory Usage
• Using /usr/bin/time command
• Measuring maximum resident set size
• Generating many sleeping threads
55
(0..ARGV[0].to_i).map do |i|
Thread.new { sleep 100 }
end
56. Decrease in Memory Usage
baseline Stretching Chaining
Increase of memory
per thread
25.7KB 14.8KB 15.0KB
56
0
100
200
300
Maximummemory
usage[MB]
Number of threads
Stretching Chanining trunk
40 %
reduction
57. Changes of Memory Usage
Depending on Initial Stack Size
• Initial stack size:
Varying from 128B to 1MB, doubling
• Number of threads: 10,000
57
58. Relationship between Initial
Stack Size and Memory Usage
58
0
50
100
150
200
250
300
Maximummemoryusage[MB]
Initial Stack Size[B]
Stretching Chanining trunk
Greater memory savings
the smaller the initial stack size
59. Reasons for only Slightly
Lower Memory Usage
• Calculated memory usage:
1 MB × 10,000 = 10 GB
• Actual memory usage:
Only about 250 MB
• Large memory allocations just reserve virtual
memory
• Mapped to real memory only when actually used
59
the Linux Kernel Organization, Inc. overcommit-accounting, 4.14
edition, January 23 2018.
60. Memory Usage
with Actual Stack Extension
• 1024 threads
• Two models for stack depth distribution:
• Linear model
• Exponential model
60
61. Linear Model
• Thread 𝑛 uses 10𝑛 recursions
• Average number of recursions: 5125
61
recursions
stack
extensions
threads
10 1 1
20 2 1
30 2 1
40 3 1
…
10,240 10 1
62. Exponential Model
• When recursion depth doubles,
number of threads is halfed
• Average number of recursions: 60
• More realistic than linear model
62
recursions
stack
extensions
threads
10 0 512
20 1 256
40 2 128
80 3 64
…
10,240 10 1
63. Memory Usage with Actual
Stack Extension: Results
• Linear model: Memory increase with Chaining
• Exponential model: Memory decrease for both
Stretching and Chaining
63
Model
Maximum Memory Usage[MB]
Stretching Chaining trunk
Linear 375.5 550.4 348.0
Exponential 24.3 24.6 34.9
Reduction of memory usage
for a realistic model
64. Experiments: Caution
Ruby memory is abstracted by four layers
1. Ruby VM we are here
2. Allocator (glib malloc, jemalloc,…)
3. Operating System (virtual memory,…)
4. Hardware (cache,
translation lookaside buffer,…)
64
Nate Berkopec, Malloc Can Double Multi-threaded Ruby Program Memory Usage,
https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
65. Dynamic Stack Extension
in other Programming Languages
• Go, mruby, Lua, Perl
Dynamic stack extension
• Java
VM specification assumes extensible stack
• Python
Separate frame for each invocation
65
66. Summary
• Stable implementation of dynamic
Ruby VM stack extension
• Memory needs for highly multithreaded
programs reduced
• Chaining decreases speed by only 6.5%
66
67. What’s Next
• Further improve speed
Try to avoid slowdown for programs without threads
• Testing on many environments
• FreeBSD,…, Windows,…
• Avoid Linux-specific code
• Community involvement
• Your opinion matters
• Your data matters even more
67
68. Acknowledgements
• Koichi Sasada(笹田 耕一)
Idea and lots of advice
• Sho Koike(小池 翔)
First implementation attempt, mistakes we were able to learn from
• Shunichi Matsubara(松原 俊一)Yoshiyuki Shoji(莊司 慶行)
Help with research and talk preparation
• Yusuke Endo(遠藤 侑介)
Interesting discussions
• Anonymous reviewer(s)
Advice on experiments
68
2種類のスタックはこの図のように
ruby仮想マシンのスタック領域に配置されます.
この発表では,Ruby Under a Microscopeという本にならって
メモリアドレスの小さい側から,大きくなる方向に成長するスタックを,内部スタック,
反対方向に成長するスタックを,コールスタックと呼びます.
仮想マシンのスタックオーバフローは二つのスタックが衝突するときに発生します.