Sista: Improving Cog’s JIT performance

  • 302 views
Uploaded on

Title: Sista: Improving Cog’s JIT performance …

Title: Sista: Improving Cog’s JIT performance
Speaker: Clément Béra
Thu, August 21, 9:45am – 10:30am

Video Part1
https://www.youtube.com/watch?v=X4E_FoLysJg
Video Part2
https://www.youtube.com/watch?v=gZOk3qojoVE

Description
Abstract: Although recent improvements of the Cog VM performance made it one of the fastest available Smalltalk virtual machine, the overhead compared to optimized C code remains important. Efficient industrial object oriented virtual machine, such as Javascript V8's engine for Google Chrome and Oracle Java Hotspot can reach on many benchs the performance of optimized C code thanks to adaptive optimizations performed their JIT compilers. The VM becomes then cleverer, and after executing numerous times the same portion of codes, it stops the code execution, looks at what it is doing and recompiles critical portion of codes in code faster to run based on the current environment and previous executions.

Bio: Clément Béra and Eliot Miranda has been working together on Cog's JIT performance for the last year. Clément Béra is a young engineer and has been working in the Pharo team for the past two years. Eliot Miranda is a Smalltalk VM expert who, among others, has implemented Cog's JIT and the Spur Memory Manager for Cog.

More in: Software
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Video Part1
    https://www.youtube.com/watch?v=X4E_FoLysJg
    Video Part2
    https://www.youtube.com/watch?v=gZOk3qojoVE
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
302
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
3
Comments
1
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Sista: Improving Cog’s JIT performance Clément Béra
  • 2. Main people involved in Sista • Eliot Miranda • Over 30 years experience in Smalltalk VM • Clément Béra • 2 years engineer in the Pharo team • Phd student starting from october
  • 3. Cog ? • Smalltalk virtual machine • Runs Pharo, Squeak, Newspeak, ...
  • 4. Plan • Efficient VM architecture (Java, C#, ...) • Sista goals • Example: optimizing a method • Our approach • Status
  • 5. Virtual machine High level code (Java, Smalltalk ...) Runtime environment (JRE, JDK, Object engine...) CPU: intel, ARM ... - OS: Windows, Android, Linux ..
  • 6. Basic Virtual machine High level code (Java, Smalltalk ...) Compiler Bytecode VM Bytecode Interpreter Memory Manager CPU RAM
  • 7. Fast virtual machine High level code (Java, Smalltalk ...) Compiler Bytecode VM Bytecode Interpreter JIT Compiler Memory Manager CPU RAM
  • 8. Efficient JIT compiler display: listOfDrinks listOfDrinks do: [ :drink | self displayOnScreen: drink ] Execution number time to run (ms) Comments 1 1 lookup of #displayOnScreen (cache) and byte code interpretation 2 to 6 0,5 byte code interpretation 7 2 generation of native code for displayOnScreen and native code run 8 to 999 0,01 nat•ive code run 1000 2 adaptive recompilation based on runtime type information, generation of native code and optimized native code run 1000 + 0,002 optimized native code run
  • 9. In Cog display: listOfDrinks listOfDrinks do: [ :drink | self displayOnScreen: drink ] Execution number time to run (ms) Comments 1 1 lookup of #displayOnScreen (cache) and byte code interpretation 2 to 6 0,5 byte code interpretation 7 2 generation of native code for display and native code run 8 to 999 0,01 nat•ive code run 1000 2 adaptive recompilation based on runtime type information, generation of native code and optimized native code run 1000 + 0,002 optimized native code run
  • 10. A look into webkit • Webkit Javascript engine • LLInt = Low Level Interpreter • DFG JIT = Data Flow Graph JIT
  • 11. Webkit JS benchs
  • 12. In Cog
  • 13. What is Sista ? • Implementing the next JIT optimization level
  • 14. Goals • Smalltalk performance • Code readability
  • 15. Smalltalk performance The Computer Language Benchmarks Game
  • 16. Smalltalk performance The Computer Language Benchmarks Game
  • 17. Smalltalk performance The Computer Language Benchmarks Game Smalltalk is 4x~25x slower than Java
  • 18. Smalltalk performance The Computer Language Benchmarks Game Smalltalk is 4x~25x slower than Java Our goal is 3x faster: 1.6x~8x times slower than Java + Smalltalk features
  • 19. Code Readability • Messages optimized by the bytecode compiler overused in the kernel • #do: => #to:do:
  • 20. Adaptive recompilation • Recompiles on-the-fly portion of code frequently used based on the current environment and previous executions
  • 21. Optimizing a method
  • 22. Example
  • 23. Example • Executing #display: with over 30 000 different drinks ...
  • 24. Hot spot detector • Detects methods frequently used
  • 25. Hot spot detector • JIT adds counters on machine code • Counters are incremented when code execution reaches it • When a counter reaches threshold, the optimizer is triggered
  • 26. Example Cannot detect hot spot Can detect hot spot (#to:do: compiled inlined)
  • 27. Hot spot detected Over 30 000 executions
  • 28. Optimizer • What to optimize ?
  • 29. Optimizer • What to optimize ? • Block evaluation is costly
  • 30. Optimizer • What to optimize ? Stack frame #display: Stack frame #do: Stack growing Hot spot down detected here
  • 31. Optimizer • What to optimize ? Stack frame #display: Stack frame #do: Hot spot detected here
  • 32. Find the best method
  • 33. Find the best method • To be able to optimize the block activation, we need to optimize both #example and #do:
  • 34. Inlining • Replaces a function call by its callee
  • 35. Inlining • Replaces a function call by its callee • Speculative: what if listOfDrinks is not an Array anymore ?
  • 36. Inlining • Replaces a function call by its callee • Speculative: what if listOfDrinks is not an Array anymore ? • Type-feedback from inline caches
  • 37. Guard • Guard checks: listOfDrinks hasClass: Array • If a guard fails, the execution stack is dynamically deoptimized • If a guard fails more than a threshold, the optimized method is uninstalled
  • 38. Dynamic deoptimization • On Stack replacement • PCs, variables and methods are mapped. Stack frame #display: Stack frame #do: Stack frame #optimizedDisplay:
  • 39. Optimizations • 1) #do: is inlined into #display:
  • 40. Optimizations • 2) Block evaluation is inlined
  • 41. Optimizations • 3) at: is optimized • at: optimized ???
  • 42. At: implementation • VM implementation • Is index an integer ? • Is object a variable-sized object, a byte object, ... ? (Array, ByteArray, ... ?) • Is index within the bound of the object ? • Answers the value
  • 43. Optimizations • 3) at: is optimized • index isSmallInteger • 1 <= index <= listOfDrinks size • listOfDrinks class == Array • uncheckedAt1: (at: for variable size objects with in-bounds integer argument)
  • 44. Restarting • On Stack replacement Stack frame #display: Stack frame #do: Stack frame #optimizedDisplay:
  • 45. Our approach
  • 46. In-image Optimizer • Inputs • Stack with hot compiled method • Branch and type information • Actions • Generates and installs an optimized compiled method • Generates deoptimization metadata • Edit the stack to use the optimized compiled method
  • 47. In-image Deoptimizer • Inputs • Stack to deoptimize (failing guard, debugger, ...) • Deoptimization metadata • Actions • Deoptimize the stack • May discard the optimized method
  • 48. Advantages • Optimizer / Deoptimizer in smalltalk • Easier to debug and program in Smalltalk than in Slang / C • Editable at runtime • Lower engineering cost
  • 49. Advantages On Stack replacement done with Context manipulation Stack frame #display: Stack frame #do: Stack frame #optimizedDisplay:
  • 50. Cons • What if the optimizer is triggered while optimizing ?
  • 51. Advantages • CompiledMethod to optimized CompiledMethod • Snapshot saves compiled methods
  • 52. Cons • Some optimizations are difficult • Instruction selection • Instruction scheduling • Register allocation
  • 53. Critical optimizations • Inlining (Methods and Closures) • Specialized at: and at:put: • Specialized arithmetic operations
  • 54. Interface VM - Image • Extended bytecode set • CompiledMethod introspection primitive • VM callback to trigger optimizer /deoptimizer
  • 55. Extended bytecode set • Guards • Specialized inlined primitives • at:, at:put: • +, - , <=, =, ...
  • 56. Introspection primitive • Answers • Type info for each message send • Branch info for each conditional jump • Answers nothing if method not in machine code zone
  • 57. VM callback • Selector in specialObjectsArray • Sent to the active context when hot spot / guard failure detected
  • 58. Stability • Low engineering resources • No thousands of human testers
  • 59. Stability • Gemstone style ? • Large test suites
  • 60. Stability • RMOD research team • Inventing new validation techniques • Implementing state-of-the-art validators
  • 61. Stability goal • Both bench and large test suite run on CI • Anyone could contribute
  • 62. Anyone can contribute
  • 63. Anyone can contribute
  • 64. Stability goal • Both bench and large test suite run on CI • Anyone could contribute
  • 65. Status • A full round trip works • Hot spot detected in the VM • In-image optimizer finds a method to optimize, optimizes it an install it • method and closure inlining • Stack dynamic deoptimization
  • 66. Status: hot spot detector • Richards benchmark • Cog: 341ms • Cog + hot spot detector: 351ms • 3% overhead on Richards • Detects hot spots • Branch counts ⇒ basic block counts
  • 67. Next steps • Dynamic deoptimization of closures is not stable enough • Efficient new bytecode instructions • Stabilizing bounds check elimination • lowering memory footprint • Invalidation of optimized methods • On stack replacement
  • 68. Thanks • Stéphane Ducasse • Marcus Denker • Yaron Kashai
  • 69. Conclusion • We hope to have a prototype running for Christmas • We hope to push it to production in around a year
  • 70. Demo ...
  • 71. Questions