Sista: Improving Cog’s JIT performance

938 views
774 views

Published on

Title: Sista: Improving Cog’s JIT performance
Speaker: Clément Béra
Thu, August 21, 9:45am – 10:30am

Video Part1
https://www.youtube.com/watch?v=X4E_FoLysJg
Video Part2
https://www.youtube.com/watch?v=gZOk3qojoVE

Description
Abstract: Although recent improvements of the Cog VM performance made it one of the fastest available Smalltalk virtual machine, the overhead compared to optimized C code remains important. Efficient industrial object oriented virtual machine, such as Javascript V8's engine for Google Chrome and Oracle Java Hotspot can reach on many benchs the performance of optimized C code thanks to adaptive optimizations performed their JIT compilers. The VM becomes then cleverer, and after executing numerous times the same portion of codes, it stops the code execution, looks at what it is doing and recompiles critical portion of codes in code faster to run based on the current environment and previous executions.

Bio: Clément Béra and Eliot Miranda has been working together on Cog's JIT performance for the last year. Clément Béra is a young engineer and has been working in the Pharo team for the past two years. Eliot Miranda is a Smalltalk VM expert who, among others, has implemented Cog's JIT and the Spur Memory Manager for Cog.

Published in: Software
1 Comment
3 Likes
Statistics
Notes
  • Video Part1
    https://www.youtube.com/watch?v=X4E_FoLysJg
    Video Part2
    https://www.youtube.com/watch?v=gZOk3qojoVE
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
938
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
8
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Sista: Improving Cog’s JIT performance

  1. 1. Sista: Improving Cog’s JIT performance Clément Béra
  2. 2. Main people involved in Sista • Eliot Miranda • Over 30 years experience in Smalltalk VM • Clément Béra • 2 years engineer in the Pharo team • Phd student starting from october
  3. 3. Cog ? • Smalltalk virtual machine • Runs Pharo, Squeak, Newspeak, ...
  4. 4. Plan • Efficient VM architecture (Java, C#, ...) • Sista goals • Example: optimizing a method • Our approach • Status
  5. 5. Virtual machine High level code (Java, Smalltalk ...) Runtime environment (JRE, JDK, Object engine...) CPU: intel, ARM ... - OS: Windows, Android, Linux ..
  6. 6. Basic Virtual machine High level code (Java, Smalltalk ...) Compiler Bytecode VM Bytecode Interpreter Memory Manager CPU RAM
  7. 7. Fast virtual machine High level code (Java, Smalltalk ...) Compiler Bytecode VM Bytecode Interpreter JIT Compiler Memory Manager CPU RAM
  8. 8. Efficient JIT compiler display: listOfDrinks listOfDrinks do: [ :drink | self displayOnScreen: drink ] Execution number time to run (ms) Comments 1 1 lookup of #displayOnScreen (cache) and byte code interpretation 2 to 6 0,5 byte code interpretation 7 2 generation of native code for displayOnScreen and native code run 8 to 999 0,01 nat•ive code run 1000 2 adaptive recompilation based on runtime type information, generation of native code and optimized native code run 1000 + 0,002 optimized native code run
  9. 9. In Cog display: listOfDrinks listOfDrinks do: [ :drink | self displayOnScreen: drink ] Execution number time to run (ms) Comments 1 1 lookup of #displayOnScreen (cache) and byte code interpretation 2 to 6 0,5 byte code interpretation 7 2 generation of native code for display and native code run 8 to 999 0,01 nat•ive code run 1000 2 adaptive recompilation based on runtime type information, generation of native code and optimized native code run 1000 + 0,002 optimized native code run
  10. 10. A look into webkit • Webkit Javascript engine • LLInt = Low Level Interpreter • DFG JIT = Data Flow Graph JIT
  11. 11. Webkit JS benchs
  12. 12. In Cog
  13. 13. What is Sista ? • Implementing the next JIT optimization level
  14. 14. Goals • Smalltalk performance • Code readability
  15. 15. Smalltalk performance The Computer Language Benchmarks Game
  16. 16. Smalltalk performance The Computer Language Benchmarks Game
  17. 17. Smalltalk performance The Computer Language Benchmarks Game Smalltalk is 4x~25x slower than Java
  18. 18. Smalltalk performance The Computer Language Benchmarks Game Smalltalk is 4x~25x slower than Java Our goal is 3x faster: 1.6x~8x times slower than Java + Smalltalk features
  19. 19. Code Readability • Messages optimized by the bytecode compiler overused in the kernel • #do: => #to:do:
  20. 20. Adaptive recompilation • Recompiles on-the-fly portion of code frequently used based on the current environment and previous executions
  21. 21. Optimizing a method
  22. 22. Example
  23. 23. Example • Executing #display: with over 30 000 different drinks ...
  24. 24. Hot spot detector • Detects methods frequently used
  25. 25. Hot spot detector • JIT adds counters on machine code • Counters are incremented when code execution reaches it • When a counter reaches threshold, the optimizer is triggered
  26. 26. Example Cannot detect hot spot Can detect hot spot (#to:do: compiled inlined)
  27. 27. Hot spot detected Over 30 000 executions
  28. 28. Optimizer • What to optimize ?
  29. 29. Optimizer • What to optimize ? • Block evaluation is costly
  30. 30. Optimizer • What to optimize ? Stack frame #display: Stack frame #do: Stack growing Hot spot down detected here
  31. 31. Optimizer • What to optimize ? Stack frame #display: Stack frame #do: Hot spot detected here
  32. 32. Find the best method
  33. 33. Find the best method • To be able to optimize the block activation, we need to optimize both #example and #do:
  34. 34. Inlining • Replaces a function call by its callee
  35. 35. Inlining • Replaces a function call by its callee • Speculative: what if listOfDrinks is not an Array anymore ?
  36. 36. Inlining • Replaces a function call by its callee • Speculative: what if listOfDrinks is not an Array anymore ? • Type-feedback from inline caches
  37. 37. Guard • Guard checks: listOfDrinks hasClass: Array • If a guard fails, the execution stack is dynamically deoptimized • If a guard fails more than a threshold, the optimized method is uninstalled
  38. 38. Dynamic deoptimization • On Stack replacement • PCs, variables and methods are mapped. Stack frame #display: Stack frame #do: Stack frame #optimizedDisplay:
  39. 39. Optimizations • 1) #do: is inlined into #display:
  40. 40. Optimizations • 2) Block evaluation is inlined
  41. 41. Optimizations • 3) at: is optimized • at: optimized ???
  42. 42. At: implementation • VM implementation • Is index an integer ? • Is object a variable-sized object, a byte object, ... ? (Array, ByteArray, ... ?) • Is index within the bound of the object ? • Answers the value
  43. 43. Optimizations • 3) at: is optimized • index isSmallInteger • 1 <= index <= listOfDrinks size • listOfDrinks class == Array • uncheckedAt1: (at: for variable size objects with in-bounds integer argument)
  44. 44. Restarting • On Stack replacement Stack frame #display: Stack frame #do: Stack frame #optimizedDisplay:
  45. 45. Our approach
  46. 46. In-image Optimizer • Inputs • Stack with hot compiled method • Branch and type information • Actions • Generates and installs an optimized compiled method • Generates deoptimization metadata • Edit the stack to use the optimized compiled method
  47. 47. In-image Deoptimizer • Inputs • Stack to deoptimize (failing guard, debugger, ...) • Deoptimization metadata • Actions • Deoptimize the stack • May discard the optimized method
  48. 48. Advantages • Optimizer / Deoptimizer in smalltalk • Easier to debug and program in Smalltalk than in Slang / C • Editable at runtime • Lower engineering cost
  49. 49. Advantages On Stack replacement done with Context manipulation Stack frame #display: Stack frame #do: Stack frame #optimizedDisplay:
  50. 50. Cons • What if the optimizer is triggered while optimizing ?
  51. 51. Advantages • CompiledMethod to optimized CompiledMethod • Snapshot saves compiled methods
  52. 52. Cons • Some optimizations are difficult • Instruction selection • Instruction scheduling • Register allocation
  53. 53. Critical optimizations • Inlining (Methods and Closures) • Specialized at: and at:put: • Specialized arithmetic operations
  54. 54. Interface VM - Image • Extended bytecode set • CompiledMethod introspection primitive • VM callback to trigger optimizer /deoptimizer
  55. 55. Extended bytecode set • Guards • Specialized inlined primitives • at:, at:put: • +, - , <=, =, ...
  56. 56. Introspection primitive • Answers • Type info for each message send • Branch info for each conditional jump • Answers nothing if method not in machine code zone
  57. 57. VM callback • Selector in specialObjectsArray • Sent to the active context when hot spot / guard failure detected
  58. 58. Stability • Low engineering resources • No thousands of human testers
  59. 59. Stability • Gemstone style ? • Large test suites
  60. 60. Stability • RMOD research team • Inventing new validation techniques • Implementing state-of-the-art validators
  61. 61. Stability goal • Both bench and large test suite run on CI • Anyone could contribute
  62. 62. Anyone can contribute
  63. 63. Anyone can contribute
  64. 64. Stability goal • Both bench and large test suite run on CI • Anyone could contribute
  65. 65. Status • A full round trip works • Hot spot detected in the VM • In-image optimizer finds a method to optimize, optimizes it an install it • method and closure inlining • Stack dynamic deoptimization
  66. 66. Status: hot spot detector • Richards benchmark • Cog: 341ms • Cog + hot spot detector: 351ms • 3% overhead on Richards • Detects hot spots • Branch counts ⇒ basic block counts
  67. 67. Next steps • Dynamic deoptimization of closures is not stable enough • Efficient new bytecode instructions • Stabilizing bounds check elimination • lowering memory footprint • Invalidation of optimized methods • On stack replacement
  68. 68. Thanks • Stéphane Ducasse • Marcus Denker • Yaron Kashai
  69. 69. Conclusion • We hope to have a prototype running for Christmas • We hope to push it to production in around a year
  70. 70. Demo ...
  71. 71. Questions

×