Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyPy's approach to construct domain-specific language runtime

2,977 views

Published on

DSL + PyPy

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

PyPy's approach to construct domain-specific language runtime

  1. 1. Tag: virtual machine, compiler, performance PyPy’s Approach to Construct Domain-specific Language Runtime
  2. 2. Tag: virtual machine, compiler, performance Construct Domain-specific Language Runtime using
  3. 3. Speed 7.4 times faster than CPython http://speed.pypy.org antocuni (PyCon Otto) PyPy Status Update April 07 2017 4 / 19
  4. 4. Why is Python slow? Interpretation overhead Boxed arithmetic and automatic overflow handling Dynamic dispatch of operations Dynamic lookup of methods and attributes Everything can change on runtime Extreme introspective and reflective capabilities Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 8 / 51
  5. 5. Why is Python slow? Boxed arithmetic and automatic overflow handling i = 0 while i < 10000000: i = i +1 Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 9 / 51
  6. 6. Why is Python slow? Dynamic dispatch of operations # while i < 1000000 9 LOAD_FAST 0 (i) 12 LOAD_CONST 2 (10000000) 15 COMPARE_OP 0 (<) 18 POP_JUMP_IF_FALSE 34 # i = i + 1 21 LOAD_FAST 0 (i) 24 LOAD_CONST 3 (1) 27 BINARY_ADD 28 STORE_FAST 0 (i) 31 JUMP_ABSOLUTE 9 Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 10 / 51
  7. 7. Why is Python slow? Dynamic lookup of methods and attributes class MyExample(object ): pass def foo(target , flag ): if flag: target.x = 42 obj = MyExample () foo(obj , True) print obj.x #=> 42 print getattr(obj , "x") #=> 42 Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 11 / 51
  8. 8. Why is Python slow? Everything can change on runtime def fn(): return 42 def hello (): return ’Hi! PyConEs!’ def change_the_world (): global fn fn = hello print fn() #=> 42 change_the_world () print fn() => ’Hi! PyConEs!’ Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 12 / 51
  9. 9. Why is Python slow? Everything can change on runtime class Dog(object ): def __init__(self ): self.name = ’Jandemor ’ def talk(self ): print "%s: guau!" % self.name class Cat(object ): def __init__(self ): self.name = ’CatInstance ’ def talk(self ): print "%s: miau!" % self.name Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 13 / 51
  10. 10. Why is Python slow? Everything can change on runtime my_pet = Dog() my_pet.talk () #=> ’Jandemor: guau!’ my_pet.__class__ = Cat my_pet.talk () #=> ’Jandemor: miau!’ Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 14 / 51
  11. 11. Why is Python slow? Extreme introspective and reflective capabilities def fill_list(name ): frame = sys._getframe (). f_back lst = frame.f_locals[name] lst.append (42) def foo (): things = [] fill_list(’things ’) print things #=> 42 Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 15 / 51
  12. 12. Why is Python slow? Everything can change on runtime def fn(): return 42 def hello (): return ’Hi! PyConEs!’ def change_the_world (): global fn fn = hello print fn() #=> 42 change_the_world () print fn() => ’Hi! PyConEs!’ Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 12 / 51
  13. 13. PyPy Translation Toolchain • Capable of compiling (R)Python! • Garbage collection! • Tracing just-in-time compiler generator! • Software transactional memory?
  14. 14. PyPy Architecture
  15. 15. PyPy based interpreters • Topaz (Ruby)! • HippyVM (PHP)! • Pyrolog (Prolog)! • pycket (Racket)! • Various other interpreters for (Scheme, Javascript, io, Gameboy)
  16. 16. Compiler / Interpreter Source: Compiler Construction, Prof. O. NierstraszSource: Compiler Construction, Prof. O. Nierstrasz
  17. 17. • intermediate representation (IR) • front end maps legal code into IR • back end maps IR onto target machine • simplify retargeting • allows multiple front ends • multiple passes better code→ Traditional 2 pass compiler
  18. 18. • analyzes and changes IR • goal is to reduce runtime • must preserve values Traditional 3 pass compiler
  19. 19. • constant propagation and folding • code motion • reduction of operator strength • common sub-expression elimination • redundant store elimination • dead code elimination Optimizer: middle end Modern optimizers are usually built as a set of passes
  20. 20. • Preserve language semantics • Reflection, Introspection, Eval • External APIs • Interpreter consists of short sequences of code • Prevent global optimizations • Typically implemented as a stack machine • Dynamic, imprecise type information • Variables can change type • Duck Typing: method works with any object that provides accessed interfaces • Monkey Patching: add members to “class” after initialization • Memory management and concurrency • Function calls through packing of operands in fat object Optimization Challenges
  21. 21. PyPy Functional Architecture
  22. 22. RPython • Python subset! • Statically typed! • Garbage collected! • Standard library almost entirely unavailable! • Some missing builtins (print, open(), …)! • rpython.rlib! • exceptions are (sometimes) ignored! • Not a really a language, rather a "state"
  23. 23. 22 PyPy Interpreter def f(x): return x + 1 >>> dis.dis(f) 2 0 LOAD_FAST 0 (x) 3 LOAD_CONST 1 (1) 6 BINARY_ADD 7 RETURN_VALUE • written in Rpython • Stack-based bytecode interpreter (like JVM) • bytecode compiler generates bytecode→ • bytecode evaluator interprets bytecode → • object space handles operations on objects→
  24. 24. 23 PyPy Bytecode Interpreter
  25. 25. 31
  26. 26. CFG (Call Flow Graph) • Consists of Blocks and Links • Starting from entry_point • “Single Static Information” form def f(n): return 3 * n + 2 Block(v1): # input argument v2 = mul(Constant(3), v1) v3 = add(v2, Constant(2))
  27. 27. 33 CFG: Static Single Information 33 def test(a): if a > 0: if a > 5: return 10 return 4 if a < - 10: return 3 return 10 • SSI: “PHIs” for all used variables • Blocks as “functions without branches”
  28. 28. • High Level Language Implementation • to implement new features: lazily computed objects and functions, plug-able  garbage-collection, runtime replacement of live-objects, stackless concurrency  • JIT Generation • Object space • Stackless • infinite Recursion • Microthreads: Coroutines, Tasklets and Channels, Greenlets PyPy Advantages
  29. 29. PERCEPTION http://abstrusegoose.com/secretarchives/under-the-hood - CC BY-NC 3.0 US
  30. 30. Assumptions Pareto Principle (80-20 rule) I the 20% of the program accounts for the 80% of the runtime I hot-spots Fast Path principle I optimize only what is necessary I fall back for uncommon cases Most of runtime spent in loops Always the same code paths (likely) antocuni (Intel@Bucharest) PyPy Intro April 4 2016 9 / 32
  31. 31. Tracing JIT phases Interpretation antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32
  32. 32. Tracing JIT phases Interpretation Tracing hot loop detected antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32
  33. 33. Tracing JIT phases Interpretation Tracing hot loop detected Compilation antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32
  34. 34. Tracing JIT phases Interpretation Tracing hot loop detected Compilation Running antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32
  35. 35. Tracing JIT phases Interpretation Tracing hot loop detected Compilation Running cold guard failed antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32
  36. 36. Tracing JIT phases Interpretation Tracing hot loop detected Compilation Running cold guard failed entering compiled loop antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32
  37. 37. Tracing JIT phases Interpretation Tracing hot loop detected Compilation Running cold guard failed entering compiled loop guard failure → hot antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32
  38. 38. Tracing JIT phases Interpretation Tracing hot loop detected Compilation Running cold guard failed entering compiled loop guard failure → hot hot guard failed antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32
  39. 39. Trace trees (1) tracetree.py def foo(): a = 0 i = 0 N = 100 while i < N: if i%2 == 0: a += 1 else: a *= 2; i += 1 return a antocuni (Intel@Bucharest) PyPy Intro April 4 2016 12 / 32
  40. 40. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  41. 41. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  42. 42. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  43. 43. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) BLACKHOLE COLD FAIL antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  44. 44. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) BLACKHOLE COLD FAIL INTERPRETER antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  45. 45. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) BLACKHOLE COLD FAIL INTERPRETER antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  46. 46. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  47. 47. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  48. 48. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) a1 = int_mul(a0, 2) i1 = int_add(i0, 1) jump(start, i1, a1) HOT FAIL antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  49. 49. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) a1 = int_mul(a0, 2) i1 = int_add(i0, 1) jump(start, i1, a1) HOT FAIL antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  50. 50. Trace trees (2) label(start, i0, a0) v0 = int_lt(i0, 2000) guard_true(v0) v1 = int_mod(i0, 2) v2 = int_eq(v1, 0) guard_true(v1) a1 = int_add(a0, 10) i1 = int_add(i0, 1) jump(start, i1, a1) a1 = int_mul(a0, 2) i1 = int_add(i0, 1) jump(start, i1, a1) HOT FAIL antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32
  51. 51. Part 3 The PyPy JIT antocuni (Intel@Bucharest) PyPy Intro April 4 2016 14 / 32
  52. 52. Terminology (1) translation time: when you run "rpython targetpypy.py" to get the pypy binary runtime: everything which happens after you start pypy interpretation, tracing, compiling assembler/machine code: the output of the JIT compiler execution time: when your Python program is being executed I by the interpreter I by the machine code antocuni (Intel@Bucharest) PyPy Intro April 4 2016 15 / 32
  53. 53. Terminology (2) interp-level: things written in RPython [PyPy] interpreter: the RPython program which executes the final Python programs bytecode: "the output of dis.dis". It is executed by the PyPy interpreter. app-level: things written in Python, and executed by the PyPy Interpreter antocuni (Intel@Bucharest) PyPy Intro April 4 2016 16 / 32
  54. 54. Terminology (3) (the following is not 100% accurate but it’s enough to understand the general principle) low level op or ResOperation I low-level instructions like "add two integers", "read a field out of a struct", "call this function" I (more or less) the same level of C ("portable assembler") I knows about GC objects (e.g. you have getfield_gc vs getfield_raw) jitcodes: low-level representation of RPython functions I sequence of low level ops I generated at translation time I 1 RPython function --> 1 C function --> 1 jitcode antocuni (Intel@Bucharest) PyPy Intro April 4 2016 17 / 32
  55. 55. Terminology (4) JIT traces or loops I a very specific sequence of llops as actually executed by your Python program I generated at runtime (more specifically, during tracing) JIT optimizer: takes JIT traces and emits JIT traces JIT backend: takes JIT traces and emits machine code antocuni (Intel@Bucharest) PyPy Intro April 4 2016 18 / 32
  56. 56. General architecture def LOAD_GLOBAL(self): ... def STORE_FAST(self): ... def BINARY_ADD(self): ... RPYTHON antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32
  57. 57. General architecture def LOAD_GLOBAL(self): ... def STORE_FAST(self): ... def BINARY_ADD(self): ... RPYTHON CODEWRITER antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32
  58. 58. General architecture def LOAD_GLOBAL(self): ... def STORE_FAST(self): ... def BINARY_ADD(self): ... RPYTHON CODEWRITER ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) .... ... p0 = getfield_gc(p0, 'locals_w') setarrayitem_gc(p0, i0, p1) .... ... promote_class(p0) i0 = getfield_gc(p0, 'intval') promote_class(p1) i1 = getfield_gc(p1, 'intval') i2 = int_add(i0, i1) if (overflowed) goto ... p2 = new_with_vtable('W_IntObject') setfield_gc(p2, i2, 'intval') .... JITCODE antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32
  59. 59. General architecture def LOAD_GLOBAL(self): ... def STORE_FAST(self): ... def BINARY_ADD(self): ... RPYTHON CODEWRITER ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) .... ... p0 = getfield_gc(p0, 'locals_w') setarrayitem_gc(p0, i0, p1) .... ... promote_class(p0) i0 = getfield_gc(p0, 'intval') promote_class(p1) i1 = getfield_gc(p1, 'intval') i2 = int_add(i0, i1) if (overflowed) goto ... p2 = new_with_vtable('W_IntObject') setfield_gc(p2, i2, 'intval') .... JITCODE compile-time runtime antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32
  60. 60. General architecture def LOAD_GLOBAL(self): ... def STORE_FAST(self): ... def BINARY_ADD(self): ... RPYTHON CODEWRITER ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) .... ... p0 = getfield_gc(p0, 'locals_w') setarrayitem_gc(p0, i0, p1) .... ... promote_class(p0) i0 = getfield_gc(p0, 'intval') promote_class(p1) i1 = getfield_gc(p1, 'intval') i2 = int_add(i0, i1) if (overflowed) goto ... p2 = new_with_vtable('W_IntObject') setfield_gc(p2, i2, 'intval') .... JITCODE compile-time runtime META-TRACER antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32
  61. 61. General architecture def LOAD_GLOBAL(self): ... def STORE_FAST(self): ... def BINARY_ADD(self): ... RPYTHON CODEWRITER ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) .... ... p0 = getfield_gc(p0, 'locals_w') setarrayitem_gc(p0, i0, p1) .... ... promote_class(p0) i0 = getfield_gc(p0, 'intval') promote_class(p1) i1 = getfield_gc(p1, 'intval') i2 = int_add(i0, i1) if (overflowed) goto ... p2 = new_with_vtable('W_IntObject') setfield_gc(p2, i2, 'intval') .... JITCODE compile-time runtime META-TRACEROPTIMIZER antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32
  62. 62. General architecture def LOAD_GLOBAL(self): ... def STORE_FAST(self): ... def BINARY_ADD(self): ... RPYTHON CODEWRITER ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) .... ... p0 = getfield_gc(p0, 'locals_w') setarrayitem_gc(p0, i0, p1) .... ... promote_class(p0) i0 = getfield_gc(p0, 'intval') promote_class(p1) i1 = getfield_gc(p1, 'intval') i2 = int_add(i0, i1) if (overflowed) goto ... p2 = new_with_vtable('W_IntObject') setfield_gc(p2, i2, 'intval') .... JITCODE compile-time runtime META-TRACEROPTIMIZERBACKEND antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32
  63. 63. General architecture def LOAD_GLOBAL(self): ... def STORE_FAST(self): ... def BINARY_ADD(self): ... RPYTHON CODEWRITER ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) .... ... p0 = getfield_gc(p0, 'locals_w') setarrayitem_gc(p0, i0, p1) .... ... promote_class(p0) i0 = getfield_gc(p0, 'intval') promote_class(p1) i1 = getfield_gc(p1, 'intval') i2 = int_add(i0, i1) if (overflowed) goto ... p2 = new_with_vtable('W_IntObject') setfield_gc(p2, i2, 'intval') .... JITCODE compile-time runtime META-TRACEROPTIMIZERBACKENDASSEMBLER antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32
  64. 64. PyPy trace example def fn(): c = a+b ... antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
  65. 65. PyPy trace example def fn(): c = a+b ... LOAD_GLOBAL A LOAD_GLOBAL B BINARY_ADD STORE_FAST C antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
  66. 66. PyPy trace example def fn(): c = a+b ... LOAD_GLOBAL A LOAD_GLOBAL B BINARY_ADD STORE_FAST C ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) ... antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
  67. 67. PyPy trace example def fn(): c = a+b ... LOAD_GLOBAL A LOAD_GLOBAL B BINARY_ADD STORE_FAST C ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) ... ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) ... antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
  68. 68. PyPy trace example def fn(): c = a+b ... LOAD_GLOBAL A LOAD_GLOBAL B BINARY_ADD STORE_FAST C ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) ... ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) ... ... guard_class(p0, W_IntObject) i0 = getfield_gc(p0, 'intval') guard_class(p1, W_IntObject) i1 = getfield_gc(p1, 'intval') i2 = int_add(00, i1) guard_not_overflow() p2 = new_with_vtable('W_IntObject') setfield_gc(p2, i2, 'intval') ... antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
  69. 69. PyPy trace example def fn(): c = a+b ... LOAD_GLOBAL A LOAD_GLOBAL B BINARY_ADD STORE_FAST C ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) ... ... p0 = getfield_gc(p0, 'func_globals') p2 = getfield_gc(p1, 'strval') call(dict_lookup, p0, p2) ... ... guard_class(p0, W_IntObject) i0 = getfield_gc(p0, 'intval') guard_class(p1, W_IntObject) i1 = getfield_gc(p1, 'intval') i2 = int_add(00, i1) guard_not_overflow() p2 = new_with_vtable('W_IntObject') setfield_gc(p2, i2, 'intval') ... ... p0 = getfield_gc(p0, 'locals_w') setarrayitem_gc(p0, i0, p1) .... antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
  70. 70. PyPy optimizer intbounds constant folding / pure operations virtuals string optimizations heap (multiple get/setfield, etc) unroll antocuni (Intel@Bucharest) PyPy Intro April 4 2016 21 / 32
  71. 71. Intbound optimization (1) intbound.py def fn(): i = 0 while i < 5000: i += 2 return i antocuni (Intel@Bucharest) PyPy Intro April 4 2016 22 / 32
  72. 72. Intbound optimization (2) unoptimized ... i17 = int_lt(i15, 5000) guard_true(i17) i19 = int_add_ovf(i15, 2) guard_no_overflow() ... optimized ... i17 = int_lt(i15, 5000) guard_true(i17) i19 = int_add(i15, 2) ... It works often array bound checking intbound info propagates all over the trace antocuni (Intel@Bucharest) PyPy Intro April 4 2016 23 / 32
  73. 73. Intbound optimization (2) unoptimized ... i17 = int_lt(i15, 5000) guard_true(i17) i19 = int_add_ovf(i15, 2) guard_no_overflow() ... optimized ... i17 = int_lt(i15, 5000) guard_true(i17) i19 = int_add(i15, 2) ... It works often array bound checking intbound info propagates all over the trace antocuni (Intel@Bucharest) PyPy Intro April 4 2016 23 / 32
  74. 74. Intbound optimization (2) unoptimized ... i17 = int_lt(i15, 5000) guard_true(i17) i19 = int_add_ovf(i15, 2) guard_no_overflow() ... optimized ... i17 = int_lt(i15, 5000) guard_true(i17) i19 = int_add(i15, 2) ... It works often array bound checking intbound info propagates all over the trace antocuni (Intel@Bucharest) PyPy Intro April 4 2016 23 / 32
  75. 75. Virtuals (1) virtuals.py def fn(): i = 0 while i < 5000: i += 2 return i antocuni (Intel@Bucharest) PyPy Intro April 4 2016 24 / 32
  76. 76. Virtuals (2) unoptimized ... guard_class(p0, W_IntObject) i1 = getfield_pure(p0, ’intval’) i2 = int_add(i1, 2) p3 = new(W_IntObject) setfield_gc(p3, i2, ’intval’) ... optimized ... i2 = int_add(i1, 2) ... The most important optimization (TM) It works both inside the trace and across the loop It works for tons of cases I e.g. function frames antocuni (Intel@Bucharest) PyPy Intro April 4 2016 25 / 32
  77. 77. Virtuals (2) unoptimized ... guard_class(p0, W_IntObject) i1 = getfield_pure(p0, ’intval’) i2 = int_add(i1, 2) p3 = new(W_IntObject) setfield_gc(p3, i2, ’intval’) ... optimized ... i2 = int_add(i1, 2) ... The most important optimization (TM) It works both inside the trace and across the loop It works for tons of cases I e.g. function frames antocuni (Intel@Bucharest) PyPy Intro April 4 2016 25 / 32
  78. 78. Virtuals (2) unoptimized ... guard_class(p0, W_IntObject) i1 = getfield_pure(p0, ’intval’) i2 = int_add(i1, 2) p3 = new(W_IntObject) setfield_gc(p3, i2, ’intval’) ... optimized ... i2 = int_add(i1, 2) ... The most important optimization (TM) It works both inside the trace and across the loop It works for tons of cases I e.g. function frames antocuni (Intel@Bucharest) PyPy Intro April 4 2016 25 / 32
  79. 79. Constant folding (1) constfold.py def fn(): i = 0 while i < 5000: i += 2 return i antocuni (Intel@Bucharest) PyPy Intro April 4 2016 26 / 32
  80. 80. Constant folding (2) unoptimized ... i1 = getfield_pure(p0, ’intval’) i2 = getfield_pure(<W_Int(2)>, ’intval’) i3 = int_add(i1, i2) ... optimized ... i1 = getfield_pure(p0, ’intval’) i3 = int_add(i1, 2) ... It "finishes the job" Works well together with other optimizations (e.g. virtuals) It also does "normal, boring, static" constant-folding antocuni (Intel@Bucharest) PyPy Intro April 4 2016 27 / 32
  81. 81. Constant folding (2) unoptimized ... i1 = getfield_pure(p0, ’intval’) i2 = getfield_pure(<W_Int(2)>, ’intval’) i3 = int_add(i1, i2) ... optimized ... i1 = getfield_pure(p0, ’intval’) i3 = int_add(i1, 2) ... It "finishes the job" Works well together with other optimizations (e.g. virtuals) It also does "normal, boring, static" constant-folding antocuni (Intel@Bucharest) PyPy Intro April 4 2016 27 / 32
  82. 82. Constant folding (2) unoptimized ... i1 = getfield_pure(p0, ’intval’) i2 = getfield_pure(<W_Int(2)>, ’intval’) i3 = int_add(i1, i2) ... optimized ... i1 = getfield_pure(p0, ’intval’) i3 = int_add(i1, 2) ... It "finishes the job" Works well together with other optimizations (e.g. virtuals) It also does "normal, boring, static" constant-folding antocuni (Intel@Bucharest) PyPy Intro April 4 2016 27 / 32
  83. 83. Out of line guards (1) outoflineguards.py N = 2 def fn(): i = 0 while i < 5000: i += N return i antocuni (Intel@Bucharest) PyPy Intro April 4 2016 28 / 32
  84. 84. Out of line guards (2) unoptimized ... quasiimmut_field(<Cell>, ’val’) guard_not_invalidated() p0 = getfield_gc(<Cell>, ’val’) ... i2 = getfield_pure(p0, ’intval’) i3 = int_add(i1, i2) optimized ... guard_not_invalidated() ... i3 = int_add(i1, 2) ... Python is too dynamic, but we don’t care :-) No overhead in assembler code Used a bit "everywhere" antocuni (Intel@Bucharest) PyPy Intro April 4 2016 29 / 32
  85. 85. Out of line guards (2) unoptimized ... quasiimmut_field(<Cell>, ’val’) guard_not_invalidated() p0 = getfield_gc(<Cell>, ’val’) ... i2 = getfield_pure(p0, ’intval’) i3 = int_add(i1, i2) optimized ... guard_not_invalidated() ... i3 = int_add(i1, 2) ... Python is too dynamic, but we don’t care :-) No overhead in assembler code Used a bit "everywhere" antocuni (Intel@Bucharest) PyPy Intro April 4 2016 29 / 32
  86. 86. Out of line guards (2) unoptimized ... quasiimmut_field(<Cell>, ’val’) guard_not_invalidated() p0 = getfield_gc(<Cell>, ’val’) ... i2 = getfield_pure(p0, ’intval’) i3 = int_add(i1, i2) optimized ... guard_not_invalidated() ... i3 = int_add(i1, 2) ... Python is too dynamic, but we don’t care :-) No overhead in assembler code Used a bit "everywhere" antocuni (Intel@Bucharest) PyPy Intro April 4 2016 29 / 32
  87. 87. Hello RPython # hello_rpython.py import os ! def entry_point(argv): os.write(2, “Hello, World!n”) return 0 ! def target(driver, argv): return entry_point, None
  88. 88. $ rpython hello_rpython.py … $ ./hello_python-c Hello, RPython!
  89. 89. Goal • BASIC interpreter capable of running Hamurabi! • Bytecode based! • Garbage Collection! • Just-In-Time Compilation
  90. 90. Live play session
  91. 91. Architecture Parser Compiler Virtual Machine AST Bytecode Source
  92. 92. 10 PRINT TAB(32);"HAMURABI" 20 PRINT TAB(15);"CREATIVE COMPUTING MORRISTOWN, NEW JERSEY" 30 PRINT:PRINT:PRINT 80 PRINT "TRY YOUR HAND AT GOVERNING ANCIENT SUMERIA" 90 PRINT "FOR A TEN-YEAR TERM OF OFFICE.":PRINT 95 D1=0: P1=0 100 Z=0: P=95:S=2800: H=3000: E=H-S 110 Y=3: A=H/Y: I=5: Q=1 210 D=0 215 PRINT:PRINT:PRINT "HAMURABI: I BEG TO REPORT TO YOU,": Z=Z+1 217 PRINT "IN YEAR";Z;",";D;"PEOPLE STARVED,";I;"CAME TO THE CITY," 218 P=P+I 227 IF Q>0 THEN 230 228 P=INT(P/2) 229 PRINT "A HORRIBLE PLAGUE STRUCK! HALF THE PEOPLE DIED." 230 PRINT "POPULATION IS NOW";P 232 PRINT "THE CITY NOW OWNS ";A;"ACRES." 235 PRINT "YOU HARVESTED";Y;"BUSHELS PER ACRE." 250 PRINT "THE RATS ATE";E;"BUSHELS." 260 PRINT "YOU NOW HAVE ";S;"BUSHELS IN STORE.": PRINT 270 REM *** MORE CODE THAT DID NOT FIT INTO THE SLIDE FOLLOWS
  93. 93. Parser Parser Abstract Syntax Tree (AST) Source
  94. 94. Parser Parser AST Source Lexer Tokens Source Parser AST
  95. 95. RPLY • Based on PLY, which is based on Lex and Yacc! • Lexer generator! • LALR parser generator
  96. 96. Lexer from rply import LexerGenerator ! lg = LexerGenerator() ! lg.add(“NUMBER”, “[0-9]+”) # … lg.ignore(“ +”) # whitespace ! lexer = lg.build().lex
  97. 97. lg.add('NUMBER', r'[0-9]*.[0-9]+') lg.add('PRINT', r'PRINT') lg.add('IF', r'IF') lg.add('THEN', r'THEN') lg.add('GOSUB', r'GOSUB') lg.add('GOTO', r'GOTO') lg.add('INPUT', r'INPUT') lg.add('REM', r'REM') lg.add('RETURN', r'RETURN') lg.add('END', r'END') lg.add('FOR', r'FOR') lg.add('TO', r'TO') lg.add('NEXT', r'NEXT') lg.add('NAME', r'[A-Z][A-Z0-9$]*') lg.add('(', r'(') lg.add(')', r')') lg.add(';', r';') lg.add('STRING', r'"[^"]*"') lg.add(':', r'r?n') lg.add(':', r':') lg.add('=', r'=') lg.add('<>', r'<>') lg.add('-', r'-') lg.add('/', r'/') lg.add('+', r'+') lg.add('>=', r'>=') lg.add('>', r'>') lg.add('***', r'***.*') lg.add('*', r'*') lg.add('<=', r'<=') lg.add('<', r'<')
  98. 98. >>> from basic.lexer import lex >>> source = open("hello.bas").read() >>> for token in lex(source): ... print token Token("NUMBER", "10") Token("PRINT", "PRINT") Token("STRING",'"HELLO BASIC!"') Token(":", "n")
  99. 99. Grammar • A set of formal rules that defines the syntax! • terminals = tokens! • nonterminals = rules defining a sequence of one or more (non)terminals
  100. 100. 10 PRINT TAB(32);"HAMURABI" 20 PRINT TAB(15);"CREATIVE COMPUTING MORRISTOWN, NEW JERSEY" 30 PRINT:PRINT:PRINT 80 PRINT "TRY YOUR HAND AT GOVERNING ANCIENT SUMERIA" 90 PRINT "FOR A TEN-YEAR TERM OF OFFICE.":PRINT 95 D1=0: P1=0 100 Z=0: P=95:S=2800: H=3000: E=H-S 110 Y=3: A=H/Y: I=5: Q=1 210 D=0 215 PRINT:PRINT:PRINT "HAMURABI: I BEG TO REPORT TO YOU,": Z=Z+1 217 PRINT "IN YEAR";Z;",";D;"PEOPLE STARVED,";I;"CAME TO THE CITY," 218 P=P+I 227 IF Q>0 THEN 230 228 P=INT(P/2) 229 PRINT "A HORRIBLE PLAGUE STRUCK! HALF THE PEOPLE DIED." 230 PRINT "POPULATION IS NOW";P 232 PRINT "THE CITY NOW OWNS ";A;"ACRES." 235 PRINT "YOU HARVESTED";Y;"BUSHELS PER ACRE." 250 PRINT "THE RATS ATE";E;"BUSHELS." 260 PRINT "YOU NOW HAVE ";S;"BUSHELS IN STORE.": PRINT 270 REM *** MORE CODE THAT DID NOT FIT INTO THE SLIDE FOLLOWS
  101. 101. program : program : line program : line program
  102. 102. line : NUMBER statements
  103. 103. statements : statement statements : statement statements
  104. 104. statement : PRINT : statement : PRINT expressions : expressions : expression expressions : expression ; expressions : expression ; expressions
  105. 105. statement : NAME = expression :
  106. 106. statement : IF expression THEN number :
  107. 107. statement : INPUT name :
  108. 108. statement : GOTO NUMBER : statement : GOSUB NUMBER : statement : RETURN :
  109. 109. statement : REM *** :
  110. 110. statement : FOR NAME = NUMBER TO NUMBER : statement : NEXT NAME :
  111. 111. statement : END :
  112. 112. expression : NUMBER expression : NAME expression : STRING expression : operation expression : ( expression ) expression : NAME ( expression )
  113. 113. operation : expression + expression operation : expression - expression operation : expression * expression operation : expression / expression operation : expression <= expression operation : expression < expression operation : expression = expression operation : expression <> expression operation : expression > expression operation : expression >= expression
  114. 114. from rply.token import BaseBox ! class Program(BaseBox): def __init__(self, lines):
 self.lines = lines AST
  115. 115. class Line(BaseBox): def __init__(self, lineno, statements): self.lineno = lineno self.statements = statements
  116. 116. class Statements(BaseBox): def __init__(self, statements): self.statements = statements
  117. 117. class Print(BaseBox): def __init__(self, expressions, newline=True): self.expressions = expressions self.newline = newline
  118. 118.
  119. 119. from rply import ParserGenerator ! pg = ParserGenerator(["NUMBER", "PRINT", …]) Parser
  120. 120. @pg.production("program : ") @pg.production("program : line") @pg.production("program : line program") def program(p): if len(p) == 2: return Program([p[0]] + p[1].get_lines()) return Program(p)
  121. 121. @pg.production("line : number statements") def line(p): return Line(p[0], p[1].get_statements())
  122. 122. @pg.production("op : expression + expression") @pg.production("op : expression * expression") def op(p): if p[1].gettokentype() == "+": return Add(p[0], p[2]) elif p[1].gettokentype() == "*": return Mul(p[0], p[2])
  123. 123. pg = ParserGenerator([…], precedence=[ ("left", ["+", "-"]), ("left", ["*", "/"]) ])
  124. 124. parse = pg.build().parse
  125. 125. Compiler/Virtual Machine Compiler Virtual Machine AST Bytecode
  126. 126. class VM(object): def __init__(self, program): self.program = program
  127. 127. class VM(object): def __init__(self, program): self.program = program self.pc = 0
  128. 128. class VM(object): def __init__(self, program): self.program = program self.pc = 0 self.frames = []
  129. 129. class VM(object): def __init__(self, program): self.program = program self.pc = 0 self.frames = [] self.iterators = []
  130. 130. class VM(object): def __init__(self, program): self.program = program self.pc = 0 self.frames = [] self.iterators = [] self.stack = []
  131. 131. class VM(object): def __init__(self, program): self.program = program self.pc = 0 self.frames = [] self.iterators = {} self.stack = [] self.variables = {}
  132. 132. class VM(object): … def execute(self): while self.pc < len(self.program.instructions): self.execute_bytecode(self.program.instructions[self.pc])
  133. 133. class VM(object): … def execute_bytecode(self, code): raise NotImplementedError(code)
  134. 134. class VM(object): ... def execute_bytecode(self): if isinstance(code, TYPE): self.execute_TYPE(code) ... else: raise NotImplementedError(code)
  135. 135. class Program(object): def __init__(self): self.instructions = [] Bytecode
  136. 136. class Instruction(object): pass
  137. 137. class Number(Instruction): def __init__(self, value): self.value = value ! class String(Instructions): def __init__(self, value): self.value = value
  138. 138. class Print(Instruction): def __init__(self, expressions, newline): self.expressions = expressions self.newline = newline
  139. 139. class Call(Instruction): def __init__(self, function_name): self.function_name = function_name
  140. 140. class Let(Instruction): def __init__(self, name): self.name = name
  141. 141. class Lookup(Instruction): def __init__(self, name): self.name = name
  142. 142. class Add(Instruction): pass ! class Sub(Instruction): pass ! class Mul(Instruction): pass ! class Equal(Instruction): pass ! ...
  143. 143. class GotoIfTrue(Instruction): def __init__(self, target): self.target = target ! class Goto(Instruction): def __init__(self, target, with_frame=False): self.target = target self.with_frame = with_frame ! class Return(Instruction): pass
  144. 144. class Input(object): def __init__(self, name): self.name = name
  145. 145. class For(Instruction): def __init__(self, variable): self.variable = variable ! class Next(Instruction): def __init__(self, variable): self.variable = variable
  146. 146. class Program(object): def __init__(self): self.instructions = [] self.lineno2instruction = {} ! def __enter__(self): return self ! def __exit__(self, exc_type, exc_value, tb): if exc_type is None: for i, instruction in enumerate(self.instructions): instruction.finalize(self, i)
  147. 147. def finalize(self, program, index): self.target = program.lineno2instruction[self.target]
  148. 148. class Program(BaseBox): … def compile(self): with bytecode.Program() as program: for line in self.lines: line.compile(program) return program
  149. 149. class Line(BaseBox): ... def compile(self, program): program.lineno2instruction[self.lineno] = len(program.instructions) for statement in self.statements: statement.compile(program)
  150. 150. class Line(BaseBox): ... def compile(self, program): program.lineno2instruction[self.lineno] = len(program.instructions) for statement in self.statements: statement.compile(program)
  151. 151. class Print(Statement): def compile(self, program): for expression in self.expressions: expression.compile(program) program.instructions.append( bytecode.Print( len(self.expressions), self.newline ) )
  152. 152. class Print(Statement): ... def compile(self, program): for expression in self.expressions: expression.compile(program) program.instructions.append( bytecode.Print( len(self.expressions), self.newline ) )
  153. 153. class Let(Statement): ... def compile(self, program): self.value.compile(program) program.instructions.append( bytecode.Let(self.name) )
  154. 154. class Input(Statement): ... def compile(self, program): program.instructions.append( bytecode.Input(self.variable) )
  155. 155. class Goto(Statement): ... def compile(self, program): program.instructions.append( bytecode.Goto(self.target) ) ! class Gosub(Statement): ... def compile(self, program): program.instructions.append( bytecode.Goto( self.target, with_frame=True ) ) ! class Return(Statement): ... def compile(self, program): program.instructions.append( bytecode.Return() )
  156. 156. class For(Statement): ... def compile(self, program): self.start.compile(program) program.instructions.append( bytecode.Let(self.variable) ) self.end.compile(program) program.instructions.append( bytecode.For(self.variable) )
  157. 157. class WrappedObject(object): pass ! class WrappedString(WrappedObject): def __init__(self, value): self.value = value ! class WrappedFloat(WrappedObject): def __init__(self, value): self.value = value
  158. 158. class VM(object): … def execute_number(self, code): self.stack.append(WrappedFloat(code.value)) self.pc += 1 ! def execute_string(self, code): self.stack.append(WrappedString(code.value)) self.pc += 1
  159. 159. class VM(object): … def execute_call(self, code): argument = self.stack.pop() if code.function_name == "TAB": self.stack.append(WrappedString(" " * int(argument))) elif code.function_name == "RND": self.stack.append(WrappedFloat(random.random())) ... self.pc += 1
  160. 160. class VM(object): … def execute_let(self, code): value = self.stack.pop() self.variables[code.name] = value self.pc += 1 ! def execute_lookup(self, code): value = self.variables[code.name] self.stack.append(value) self.pc += 1
  161. 161. class VM(object): … def execute_add(self, code): right = self.stack.pop() left = self.stack.pop() self.stack.append(WrappedFloat(left + right)) self.pc += 1
  162. 162. class VM(object): … def execute_goto_if_true(self, code): condition = self.stack.pop() if condition: self.pc = code.target else: self.pc += 1
  163. 163. class VM(object): … def execute_goto(self, code): if code.with_frame: self.frames.append(self.pc + 1) self.pc = code.target
  164. 164. class VM(object): … def execute_return(self, code): self.pc = self.frames.pop()
  165. 165. class VM(object): … def execute_input(self, code): value = WrappedFloat(float(raw_input() or “0.0”)) self.variables[code.name] = value self.pc += 1
  166. 166. class VM(object): … def execute_for(code): self.pc += 1 self.iterators[code.variable] = ( self.pc, self.stack.pop() )
  167. 167. class VM(object): … def execute_next(self, code): loop_begin, end = self.iterators[code.variable] current_value = self.variables[code.variable].value next_value = current_value + 1.0 if next_value <= end: self.variables[code.variable] = WrappedFloat(next_value) self.pc = loop_begin else: del self.iterators[code.variable] self.pc += 1
  168. 168. def entry_point(argv): try: filename = argv[1] except IndexError: print(“You must supply a filename”) return 1 content = read_file(filename) tokens = lex(content) ast = parse(tokens) program = ast.compile() vm = VM(program) vm.execute() return 0 Entry Point
  169. 169. JIT (in PyPy) 1. Identify “hot" loops! 2. Create trace inserting guards based on observed values! 3. Optimize trace! 4. Compile trace! 5. Execute machine code instead of interpreter
  170. 170. from rpython.rlib.jit import JitDriver ! jitdriver = JitDriver( greens=[“pc”, “vm”, “program”, “frames”, “iterators”], reds=[“stack”, “variables"] )
  171. 171. class VM(object): … def execute(self): while self.pc < len(self.program.instructions): jitdriver.merge_point( vm=self, pc=self.pc, … )
  172. 172. Benchmark 10 N = 1 20 IF N <= 10000 THEN 40 30 END 40 GOSUB 100 50 IF R = 0 THEN 70 60 PRINT "PRIME"; N 70 N = N + 1: GOTO 20 100 REM *** ISPRIME N -> R 110 IF N <= 2 THEN 170 120 FOR I = 2 TO (N - 1) 130 A = N: B = I: GOSUB 200 140 IF R <> 0 THEN 160 150 R = 0: RETURN 160 NEXT I 170 R = 1: RETURN 200 REM *** MOD A -> B -> R 210 R = A - (B * INT(A / B)) 220 RETURN
  173. 173. cbmbasic 58.22s basic-c 5.06s basic-c-jit 2.34s Python implementation (CPython) 2.83s Python implementation (PyPy) 0.11s C implementation 0.03s
  174. 174. Project milestones 2008 Django support 2010 First JIT-compiler 2011 Compatibility with CPython 2.7 2014 Basic ARM support CPython 3 support Improve compatibility with C extensions NumPyPy Multi-threading support
  175. 175. PyPy STM
  176. 176. PyPy STM http://dabeaz.com/GIL/gilvis/ GIL locking
  177. 177. PyPy STM 10 loops, best of 3: 1.2 sec per loop10 loops, best of 3: 822 msec per loop from threading import Thread def count(n): while n > 0: n -= 1 def run(): t1 = Thread(target=count, args=(10000000,)) t1.start() t2 = Thread(target=count, args=(10000000,)) t2.start() t1.join(); t2.join() def count(n): while n > 0: n -= 1 def run(): count(10000000) count(10000000) Inside the Python GIL - David Beazley
  178. 178. PyPy in the real world (1) High frequency trading platform for sports bets I low latency is a must PyPy used in production since 2012 ~100 PyPy processes running 24/7 up to 10x speedups I after careful tuning and optimizing for PyPy antocuni (PyCon Otto) PyPy Status Update April 07 2017 6 / 19
  179. 179. PyPy in the real world (2) Real-time online advertising auctions I tight latency requirement (<100ms) I high throughput (hundreds of thousands of requests per second) 30% speedup We run PyPy basically everywhere Julian Berman antocuni (PyCon Otto) PyPy Status Update April 07 2017 7 / 19
  180. 180. PyPy in the real world (3) IoT on the cloud 5-10x faster We do not even run benchmarks on CPython because we just know that PyPy is way faster Tobias Oberstein antocuni (PyCon Otto) PyPy Status Update April 07 2017 8 / 19

×