Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2009 Eclipse Con


Published on

Preliminary slide deck

Published in: Technology
  • Be the first to comment

2009 Eclipse Con

  1. 1. A Lock-Free Hash Table Fast Bytecodes for Funny Languages Dr. Cliff Click Chief JVM Architect & Distinguished Engineer Azul Systems March 25, 2009
  2. 2. Fast Bytecodes For Funny Languages <ul><li>JVMs are used for lots of non-Javac bytecodes
  3. 3. Bytecode patterns are different
  4. 4. Bytecode producer (this crowd) is curious: </li><ul><li>Are these bytecodes “fast”? “fast enough”?
  5. 5. Am I inlining?
  6. 6. JIT'ing to good code? (JIT'ing at all?) </li></ul><li>Personal goal: you ARE using that fancy JIT, right? </li><ul><li>Otherwise, why did I bother? ;-) </li></ul><li>'nother goal: maybe help the world escape the Java box
  7. 7. New language must have a fast reputation </li><ul><li>Or a large programmer population won't look </li></ul></ul>
  8. 8. Can Your Language Go “To The Metal”? <ul><li>Can your bytecodes go “To The Metal”? </li><ul><li>e.g. Can simple code be mapped to simple machine ops </li></ul><li>Methodology: </li><ul><li>Write dumb hot SIMPLE loop in lots of languages
  9. 9. Look at performance
  10. 10. Look at JIT'd code
  11. 11. Look for mismatch between language & JVM & machine code </li></ul><li>NOT-GOAL: Who is fastest
  12. 12. NOT TESTED: </li><ul><li>Ease of programming, time-to-market, maintenance cost
  13. 13. Domain fit (e.g. Ruby-on-Rails) </li></ul><li>IGNORING: </li><ul><li>Blatant language / microbenchmark mismatch (FixNum/BigNum) </li></ul></ul>
  14. 14. Can You Take THIS “To The Metal”? <ul><li>See “ ” for( int i=1; i<100000000; i++ ) sum += (sum^i)/i;
  15. 15. Run with Java, Scala, Clojure, JRuby, JPC, Javascript/Rhino </li><ul><li>Willing to try more languages </li></ul><li>NOT Goal: </li><ul><li>Discover the fastest </li></ul><li>NOT Fair: </li><ul><li>I changed benchmark to be more “fair” to various languages
  16. 16. e.g. Using 'double' instead of 'int' for Javascript. </li></ul><li>Tested on Azul JVM </li><ul><li>Because it's got the best low-level JVM perf analysis tools I've seen </li></ul></ul>
  17. 17. Scorecard <ul><li>Java (unrolled, showing 1 iter): add rTmp0, rI, 5 // for unrolled iter #5 xor rTmp1, rSum, rTmp0 div rTmp2, rTmp1, rTmp0 add rSum,rTmp2,rSum ... repeat for unrolled iterations
  18. 18. Scala – Same as Java! </li><ul><li>Scala “Elegant Scala-ized version” - Ugh </li></ul><li>Clojure – Almost close; very allocation heavy; oddball 'holes'
  19. 19. JRuby – Major inlining not happening , </li><ul><li>misled by debugging flag? </li></ul><li>JPC – Fascinating; totally inlined X86 emulation </li><ul><li>But JIT doesn't grok e.g. dead X86 flag setting </li></ul><li>Javascript/Rhino – Death by 'Double' (not 'double') </li></ul>
  20. 20. Deeper Dive <ul><li>Java – Unfair advantage. Semantics match JIT expectations.
  21. 21. Scala – Borrowed heavily from Java semantics </li><ul><li>Close fit allows close translation
  22. 22. And thus nearly direct translation to machine code
  23. 23. Penalty: Only have Java semantics, e.g. no 'unsigned' type, no auto-BigNum inflation </li></ul><li>JPC – Java “PC”; pure Java X86/DOS emulator </li><ul><li>Massive bytecode generation; LOTS of JIT'ing
  24. 24. For this example: 16000 classes, 7800 compilations
  25. 25. But JIT falls down removing e.g. redundant X86 flag sets
  26. 26. Maybe fix by never storing 'flags' into memory
  27. 27. Also no fixnum issue </li></ul></ul>
  28. 28. Deeper Dive <ul><li>Clojure - “almost close” </li><ul><li>Good: no obvious subroutine calls in inner loop
  29. 29. Bad: Massive “ephemeral” object allocation - requires good GC
  30. 30. But needs Escape Analysis to go fast
  31. 31. Ugly: fix-num overflow checks everywhere
  32. 32. Can turn off fix-nums; could be same speed as Java
  33. 33. Weird “holes” - Not-optimized reflection calls here & there </li></ul><li>Jython - “almost close” </li><ul><li>Also has Fixnum issue; massive allocation
  34. 34. Some extra locks thrown in </li></ul><li>JavaScript/Rhino </li><ul><li>All things are Doubles – not 'double'
  35. 35. Same allocation issues as Fixnum
  36. 36. Otherwise looks pretty good (no fixnum checks) </li></ul></ul>
  37. 37. Deeper Dive <ul><li>Common Issue – FixNums </li><ul><li>Allocation costs (but GC does not); final fields cost mfence
  38. 38. Could do much better w/JIT
  39. 39. Need ultra-stupid Escape Analysis
  40. 40. Need some (easy) JIT tweaking
  41. 41. e.g. Getting around Integer.valueOf(i) caching </li></ul><li>JRuby – Missed The Metal </li><ul><li>Assuming CachingCallSite::call inlines (and allows further inlining)
  42. 42. Using +PrintInlining to determine
  43. 43. But flag is lying: claims inlined, but it's not </li><ul><li>(issue is w/BimorphicInlining 'guessing' wrong target) </li></ul><li>Confirmed w/debug Java5 & GDB on product-mode Java6
  44. 44. Confirmed w/Azul – My 1 st impression: can't be inlined ever </li><ul><li>(but please Charles tell me why you think it should!) </li></ul></ul></ul>
  45. 45. What Happened to JRuby? <ul><li>Calls stitched together with trampoline </li></ul>
  46. 46. What Happened to JRuby? <ul><li>Goal: inline 'targ' into caller
  47. 47. JIT removes trampoline
  48. 48. Possibly inlines </li></ul>
  49. 49. What Happened to JRuby? <ul><li>Code not statically analyzable
  50. 50. No inlining during profiling
  51. 51. Profiling confuses callers per callsite </li></ul>
  52. 52. What Happened to JRuby? <ul><li>Code not statically analyzable
  53. 53. No inlining during profiling
  54. 54. Profiling confuses callers per callsite
  55. 55. C2 inlines 'call' </li><ul><li>No dominate target
  56. 56. No attempt at guarded inlining </li></ul></ul>
  57. 57. Lessons <ul><li>CAN take “funny language” to the metal (e.g. Scala)
  58. 58. Easy to be misled (JRuby) </li><ul><li>Reliance on non-QA'd debugging flag +PrintInlining
  59. 59. Rules on inlining are complex, subtle </li></ul><li>But so much performance depends on it! </li><ul><li>And exact details matter: class hierarchy, final, interface, single-target </li></ul><li>And how do you know about JRockit (BEA), IBM, etc?
  60. 60. And heuristics change anyways, </li><ul><li>what works today is dog-slow tomorrow
  61. 61. Unless your language hits the standard benchmark list </li></ul><li>Language/bytecode mismatch made worse by assuming </li><ul><li>e.g. Uber GC or Uber Escape Analysis, or subtle final-field semantics </li></ul></ul>
  62. 62. Some Suggestions <ul><li>How do you when your bytecodes are working well? </li><ul><li>Lack of good tools support for telling why optimizer bailed out
  63. 63. Same lack for C++, Fortran for the past 50 yrs </li></ul><li>1- Run microbenchmark w/hot simple code & GDB it </li><ul><li>Why? So you can 'read' the asm
  64. 64. Break in w/GDB and – read the asm
  65. 65. Did translation go as you expected? (no?)
  66. 66. Simplify even more... </li></ul><li>2- Run Your Fav JVM w/debugging flags </li><ul><li>But debugging flags can lie
  67. 67. Must cross-correlate with (1) </li></ul><li>3- Run on Azul & use RTPM </li><ul><li>Only sorta kidding: email me & ask for Academic Account </li></ul></ul>
  68. 68. Round 2: Alternative Concurrency <ul><li>Only Clojure looked at - got lazy busy
  69. 69. Clojure Traveling Salesmen Problem w/worker 'ants'
  70. 70. Tried up to 600 'ants' on a 768-way Azul </li><ul><li>Good scaling (but 20Gig/sec allocation) </li></ul><li>Tried contention microbenchmark </li><ul><li>Performance died
  71. 71. Less TOTAL throughput as more CPUs added
  72. 72. JDK 6 Library failure? Clojure usage model failure? </li><ul><li>Not graceful fall-back under pressure </li></ul></ul></ul>
  73. 73. Lessons? <ul><li>Too little data (e.g. no Scala), just my speculation
  74. 74. Clojure-style MIGHT work, might not </li><ul><li>600-way thread-pool works well on Java also </li></ul><li>NOT graceful under pressure </li><ul><li>Adding more CPUs should not be bad
  75. 75. Thru-put cap expected, or maybe slow degrade – but rapid falloff!?!?
  76. 76. Maybe STM becomes “graceful under pressure” later
  77. 77. (but it's been 15yrs at still not good) </li></ul><li>Note: complex locking schemes suffer same way </li><ul><li>eg. add more concurrent DB requests, DB throughput goes up
  78. 78. ..then down, then craters
  79. 79. Why does DB not queue requests & maintain max throughput? </li></ul></ul>
  80. 80. Future Big Problem <ul><li>Reliable performance “under pressure” </li><ul><li>Eg adding more Threads to a hot lock drops throughput some </li><ul><li>...then stabilizes throughput as more threads added </li></ul><li>Eg adding more CPUs to a wait/notifyall peaks throughput </li><ul><li>...then falls constant as most threads uselessly wakeup & sleep </li></ul></ul><li>And we're working w/weak & immature concurrency libs
  81. 81. Everybody has a max-throughput </li><ul><li>And Fall-off Under Load is Bad (FULB ™ ) </li></ul><li>Naively: just queue requests beyond saturation point </li><ul><li>And maintain max-throughput </li></ul><li>Then why not just publish that (now reliable) max throughput
  82. 82. So can predict performance as the min of max-thruput's... </li></ul>