Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
• A NES Emulator written in Ruby
Demo
2
• To drive “Ruby3x3”
– Matz said “Ruby 3 will be 3 times faster than Ruby 2.0”
– Optcarrot is a CPU-intensive, real-life b...
• Famicom programming with Ruby
(takkaw, 2007)
– Presentation NES ROM by Ruby
• MRI's incremental GC
(authornari, 2008)
– ...
• NES architecture in three minutes
• How I achieved 20 fps
• Ruby interpreters’ benchmark
• Towards 60 fps
• Speaker's aw...
• The details of NES architecture
– In short: “See http://wiki.nesdev.com/ !”
• How to find the bottleneck
– In short: “Us...
•  NES Architecture in three minutes 
• How I achieved 20 fps
• Ruby interpreters’ benchmark
• Towards 60 fps
• Speaker'...
CPU GPU
Program ROM Bitmap ROM
Cartridge
NES
RAM
(2 kB)
VRAM
(2 kB)
control
read
read/write
read
render
read/write
To be p...
GPU
80%
CPU
10%
others
10%
Execution time
ratio
• Why does GPU emulation
take so much?
– GPU runs at higher
clock speed th...
• Per-pixel tasks (i.e. 256 x 240 x 60 = 3.7M times per second)
1. Identify what bitmap is shown here
2. Read attribute da...
• Terribly complex
http://wiki.nesdev.com/w/index.php/File:Ntsc_timing.png
11
• NES Architecture in three minutes
•  How I achieved 20 fps 
– How to emulate CPU-GPU parallelism
– How to optimize GPU...
• Naïve approach: emulate CPU & GPU per clock
1. Run the CPU for one clock
2. Run the GPU for three clocks
3. Repeat 1 and...
• “Catch-up” method: emulate CPU&GPU per control
1. Run the CPU until it tries to control the GPU
2. Run the GPU until it ...
• Naïve approach: per-pixel emulation
– Just as like the actual hardware
Bitmap ROM
Background map
Attribute map
VRAM
GPU2...
• Pre-render the screen and update it on demand
Background map
Attribute map
VRAM
GPU
screen buffer
When VRAM is
modified ...

• Intel® Core™ i7-4500U @ 2.40 GHz
• Ubuntu 16.04
17
• NES Architecture in three minutes
• How I achieved 20 fps
•  Ruby interpreters’ benchmark 
• Towards 60 fps
• Speaker'...
• Is not so big: <5000 lines of code
– cf. redmine: >30000 LOC
• Requires no library (in no-GUI mode)
– It works on miniru...
28.7
28.1
25.5
26.6
25.0
21.4
5.83
21.9
39.2
25.0
4.10
7.48
27.0
0.0287
0.0 10.0 20.0 30.0 40.0
trunk
ruby23
ruby22
ruby21...
• JRuby 9k is the fastest:
“Deoptimization” looks a promising approach
– At first, an optimized byte-code is generated wit...
• NES Architecture in three minutes
• How I achieved 20 fps
• Ruby interpreters’ benchmark
•  Towards 60 fps 
• Speaker'...
™
• We have kept the code reasonably clean so far
• Now, we use any means to achieve the speed
• CAUTION: Casual Ruby prog...
™
• Method call is slow
– Replace it with its method definition
while catchup?
inc_addr
end
while catchup?
@addr += 1
end
...
™
• Instance variable access is slow
– Replace it with local variable
– Note: the variable must not be used out of this me...
• Batch multiple frequent
actions across some clocks
™ while catchup?
if can_be_fast?
# fast-path
do_A
do_B
do_C
@clock +=...
™
29.4
40.3
46.6
62.7
68.8
83.2
0.0 20.0 40.0 60.0 80.0
base
method inlining
ivar localization
fastpath
misc
CPU misc
ProT...
• Used Regexp to systematically rewrite the code
– instead of hand-rewriting
• Used Welch’s t-test to confirm each optimiz...
29
28.6
28.0
25.2
26.9
26.1
21.4
5.87
22.8
39.3
25.3
3.97
7.02
29.3
0.0285
84.0
82.9
78.2
79.6
68.1
64.0
1.46
69.0
2.12
6.13
...
• NES Architecture in three minutes
• How I achieved 20 fps
• Ruby interpreters’ benchmark
• Towards 60 fps
•  Speaker's ...
• The first person who
improved MRI performance
by using Optcarrot
– Instance variable access has
been improved about 10%
...
• Optcarrot, a pure-Ruby NES emulator
– Non-trivial benchmark for Ruby implementations
• Wide-range Ruby implementation be...
34
¥2,680 + tax ¥5,440 + tax
Optcarrot: A Pure-Ruby NES Emulator
Upcoming SlideShare
Loading in …5
×

Optcarrot: A Pure-Ruby NES Emulator

8,008 views

Published on

http://regional.rubykaigi.org/tokyo11/
https://github.com/mame/optcarrot

Published in: Entertainment & Humor
  • Be the first to comment

Optcarrot: A Pure-Ruby NES Emulator

  1. 1. • A NES Emulator written in Ruby Demo 2
  2. 2. • To drive “Ruby3x3” – Matz said “Ruby 3 will be 3 times faster than Ruby 2.0” – Optcarrot is a CPU-intensive, real-life benchmark • Currently works at 20 fps in Ruby 2.0  60 fps in 3.0! • A carrot to let horses (Ruby committers) optimize Ruby • To challenge Ruby’s limit – NES video resolution: 256 x 240 pixels / 60 fps – We need to do all other tasks in 0.8 sec.? Impossible? (256*240*60).times do |i| ary[0] = 0 end 0.2 sec. 3
  3. 3. • Famicom programming with Ruby (takkaw, 2007) – Presentation NES ROM by Ruby • MRI's incremental GC (authornari, 2008) – Mario-like game "Nario" is used to demonstrate the real-time GC • Burn (remore, 2014) – A framework to create NES ROM in Ruby 4
  4. 4. • NES architecture in three minutes • How I achieved 20 fps • Ruby interpreters’ benchmark • Towards 60 fps • Speaker's award & Conclusion 5
  5. 5. • The details of NES architecture – In short: “See http://wiki.nesdev.com/ !” • How to find the bottleneck – In short: “Use stackprof!” 6 川崎Ruby会議01 (2016/08/20) • I’ll talk these topics at “Kawasaki Ruby Kaigi 01”
  6. 6. •  NES Architecture in three minutes  • How I achieved 20 fps • Ruby interpreters’ benchmark • Towards 60 fps • Speaker's award & Conclusion 7
  7. 7. CPU GPU Program ROM Bitmap ROM Cartridge NES RAM (2 kB) VRAM (2 kB) control read read/write read render read/write To be precise: GPU is called as “PPU” (Picture Processing Unit) in NES interrupt 8
  8. 8. GPU 80% CPU 10% others 10% Execution time ratio • Why does GPU emulation take so much? – GPU runs at higher clock speed than CPU • GPU: 5.3 MHz • CPU: 1.8 MHz – GPU does many complex tasks • Background rendering • Sprite rendering • Scrolling • Conflict detection • Interrupts 9
  9. 9. • Per-pixel tasks (i.e. 256 x 240 x 60 = 3.7M times per second) 1. Identify what bitmap is shown here 2. Read attribute data (color, flip flag, z-index) 3. Read bitmap data from the ROM 4. Assemble them into video signal Background map Attribute map VRAM GPU2 1 3 4 Target pixel To be precise: These tasks are actually done per eight pixels 10 Bitmap ROM Cartridge
  10. 10. • Terribly complex http://wiki.nesdev.com/w/index.php/File:Ntsc_timing.png 11
  11. 11. • NES Architecture in three minutes •  How I achieved 20 fps  – How to emulate CPU-GPU parallelism – How to optimize GPU emulation • Ruby interpreters’ benchmark • Towards 60 fps • Speaker's award & Conclusion 12
  12. 12. • Naïve approach: emulate CPU & GPU per clock 1. Run the CPU for one clock 2. Run the GPU for three clocks 3. Repeat 1 and 2 – Simple and accurate – Very slow (~ 3 fps) because of too many method calls CPU step step step step step step step step step step step step step step step step clock GPU 13
  13. 13. • “Catch-up” method: emulate CPU&GPU per control 1. Run the CPU until it tries to control the GPU 2. Run the GPU until it catch up with the CPU 3. Repeat 1 and 2 – Accurate and fast (~ 10 fps) CPU run catchup run catchup run clock GPU CPU attempts to control GPU 14
  14. 14. • Naïve approach: per-pixel emulation – Just as like the actual hardware Bitmap ROM Background map Attribute map VRAM GPU2 1 3 4 This calculation is done for each iteration  Slow! 15 Cartridge
  15. 15. • Pre-render the screen and update it on demand Background map Attribute map VRAM GPU screen buffer When VRAM is modified by CPU, Only invalidated pixels is updated Transported to TV per frame This explanation is over exaggerated! Actually, the GPU emulation loop is not removed completely. 16 Bitmap ROM Cartridge
  16. 16.  • Intel® Core™ i7-4500U @ 2.40 GHz • Ubuntu 16.04 17
  17. 17. • NES Architecture in three minutes • How I achieved 20 fps •  Ruby interpreters’ benchmark  • Towards 60 fps • Speaker's award & Conclusion 18
  18. 18. • Is not so big: <5000 lines of code – cf. redmine: >30000 LOC • Requires no library (in no-GUI mode) – It works on miniruby – ruby-ffi is used for GUI (SDL2) • Uses only basic Ruby features – It works on ruby 1.8 / mruby / topaz / opal (with shim and/or systematic modification of source code) 19
  19. 19. 28.7 28.1 25.5 26.6 25.0 21.4 5.83 21.9 39.2 25.0 4.10 7.48 27.0 0.0287 0.0 10.0 20.0 30.0 40.0 trunk ruby23 ruby22 ruby21 ruby20 ruby193 ruby187 omrpreview jruby9k jruby17 rubinius mruby topaz opal 20 MRI has been improved (1.81.92.02.3) OMR preview isn’t fast? (MRI 2.2 w/ JIT) JRuby9k is the fastest ruby 2.0 achives >20 fps (important for Ruby3x3) Optcarrot works on subset Ruby impls.
  20. 20. • JRuby 9k is the fastest: “Deoptimization” looks a promising approach – At first, an optimized byte-code is generated with ignoring rare/pathological cases – When needed, it is discarded and a naïve byte-code is regenerated – BTW: JRuby‘s boot time is too bad • OMR is not so fast? – JIT has no advantage? • Method calls and built-in methods may be still bottleneck – OMR seems not to support opt_case_dispatch yet • i.e., a case statement is not optimized well? 21
  21. 21. • NES Architecture in three minutes • How I achieved 20 fps • Ruby interpreters’ benchmark •  Towards 60 fps  • Speaker's award & Conclusion 22
  22. 22. ™ • We have kept the code reasonably clean so far • Now, we use any means to achieve the speed • CAUTION: Casual Ruby programmers MUST NOT use the following ProTips™ – This is an experiment to study how to improve Ruby implementation 23
  23. 23. ™ • Method call is slow – Replace it with its method definition while catchup? inc_addr end while catchup? @addr += 1 end 28 fps  40 fps 24
  24. 24. ™ • Instance variable access is slow – Replace it with local variable – Note: the variable must not be used out of this method while catchup? @addr += 1 end begin addr = @addr while catchup? addr += 1 end ensure @addr = addr end 40 fps  47 fps 25
  25. 25. • Batch multiple frequent actions across some clocks ™ while catchup? if can_be_fast? # fast-path do_A do_B do_C @clock += 3 else case @clock when 1 then do_A when 2 then do_B when 3 then do_C ... end @clock += 1 end end while catchup? case @clock when 1 then do_A when 2 then do_B when 3 then do_C ... end @clock += 1 end 47 fps  63 fps 26
  26. 26. ™ 29.4 40.3 46.6 62.7 68.8 83.2 0.0 20.0 40.0 60.0 80.0 base method inlining ivar localization fastpath misc CPU misc ProTip™ 1 ProTip™ 2 ProTip™ 3 27
  27. 27. • Used Regexp to systematically rewrite the code – instead of hand-rewriting • Used Welch’s t-test to confirm each optimization src = File.read(__FILE__) src.gsub!(/.../) { ... } # method inlining src.gsub!(/.../) { ... } # ivar localization eval(src) 28
  28. 28. 29
  29. 29. 28.6 28.0 25.2 26.9 26.1 21.4 5.87 22.8 39.3 25.3 3.97 7.02 29.3 0.0285 84.0 82.9 78.2 79.6 68.1 64.0 1.46 69.0 2.12 6.13 2.43 0.754 0.0501 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 trunk ruby23 ruby22 ruby21 ruby20 ruby193 ruby187 omrpreview jruby9k jruby17 rubinius mruby topaz opal default mode optimized mode The generated program is too large to fit JVM 64k bytecode limit 30
  30. 30. • NES Architecture in three minutes • How I achieved 20 fps • Ruby interpreters’ benchmark • Towards 60 fps •  Speaker's award & Conclusion  31
  31. 31. • The first person who improved MRI performance by using Optcarrot – Instance variable access has been improved about 10% [Bug #12274] • Optcarrot has already started to improve Ruby! 32
  32. 32. • Optcarrot, a pure-Ruby NES emulator – Non-trivial benchmark for Ruby implementations • Wide-range Ruby implementation benchmark – AFAIK, this is the first real-life benchmark to compare MRI / Jruby / Rubinius / mruby / topaz / opal • ProTips™ for boosting a Ruby program – Need to improve method calls and instance variables instead of JIT? • More details?  33 川崎Ruby会議01 (2016/08/20)
  33. 33. 34 ¥2,680 + tax ¥5,440 + tax

×