JIT compilation
for CPython
Dmitry Alimov
2019
SPb Python
JIT compilation and JIT history
My experience with JIT in CPython
Python projects that use JIT and projects for JIT
Outline
What is JIT compilation
JIT
Just-in-time compilation (aka dynamic translation, run-time compilation)
JIT
Just-in-time compilation (aka dynamic translation, run-time compilation)
The earliest JIT compiler on LISP by John McCarthy in 1960
JIT
Just-in-time compilation (aka dynamic translation, run-time compilation)
The earliest JIT compiler on LISP by John McCarthy in 1960
Ken Thompson in 1968 used for regex in text editor QED
JIT
Just-in-time compilation (aka dynamic translation, run-time compilation)
The earliest JIT compiler on LISP by John McCarthy in 1960
Ken Thompson in 1968 used for regex in text editor QED
LC2
JIT
Just-in-time compilation (aka dynamic translation, run-time compilation)
The earliest JIT compiler on LISP by John McCarthy in 1960
Ken Thompson in 1968 used for regex in text editor QED
LC2
Smalltalk
JIT
Just-in-time compilation (aka dynamic translation, run-time compilation)
The earliest JIT compiler on LISP by John McCarthy in 1960
Ken Thompson in 1968 used for regex in text editor QED
LC2
Smalltalk
Self
JIT
Just-in-time compilation (aka dynamic translation, run-time compilation)
The earliest JIT compiler on LISP by John McCarthy in 1960
Ken Thompson in 1968 used for regex in text editor QED
LC2
Smalltalk
Self
Popularized by Java with James Gosling using the term from 1993
JIT
Just-in-time compilation (aka dynamic translation, run-time compilation)
The earliest JIT compiler on LISP by John McCarthy in 1960
Ken Thompson in 1968 used for regex in text editor QED
LC2
Smalltalk
Self
Popularized by Java with James Gosling using the term from 1993
Just-in-time manufacturing, also known as just-in-time production or the Toyota
Production System (TPS)
My experience with
JIT in CPython
Example
def fibonacci(n):
"""Returns n-th Fibonacci number"""
a = 0
b = 1
if n < 1:
return a
i = 0
while i < n:
temp = a
a = b
b = temp + b
i += 1
return a
Fibonacci Sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ...
Let’s JIT it
1) Convert function to machine code at run-time
Let’s JIT it
1) Convert function to machine code at run-time
2) Execute this machine code
Let’s JIT it
@jit
def fibonacci(n):
"""Returns n-th Fibonacci number"""
a = 0
b = 1
if n < 1:
return a
i = 0
while i < n:
temp = a
a = b
b = temp + b
i += 1
return a
Convert function to AST
import ast
import inspect
lines = inspect.getsource(func)
node = ast.parse(lines)
visitor = Visitor()
visitor.visit(node)
AST
Module(body=[
FunctionDef(name='fibonacci', args=arguments(args=[Name(id='n', ctx=Param())],
vararg=None, kwarg=None, defaults=[]), body=[
Expr(value=Str(s='Returns n-th Fibonacci number')),
Assign(targets=[Name(id='a', ctx=Store())], value=Num(n=0)),
Assign(targets=[Name(id='b', ctx=Store())], value=Num(n=1)),
If(test=Compare(left=Name(id='n', ctx=Load()), ops=[Lt()], comparators=[Num(n=1)]), body=[
Return(value=Name(id='a', ctx=Load()))
], orelse=[]),
Assign(targets=[Name(id='i', ctx=Store())], value=Num(n=0)),
While(test=Compare(left=Name(id='i', ctx=Load()), ops=[Lt()], comparators=[Name(id='n', ctx=Load())]), body=[
Assign(targets=[Name(id='temp', ctx=Store())], value=Name(id='a', ctx=Load())),
Assign(targets=[Name(id='a', ctx=Store())], value=Name(id='b', ctx=Load())),
Assign(targets=[Name(id='b', ctx=Store())], value=BinOp(
left=Name(id='temp', ctx=Load()), op=Add(), right=Name(id='b', ctx=Load()))),
AugAssign(target=Name(id='i', ctx=Store()), op=Add(), value=Num(n=1))
], orelse=[]),
Return(value=Name(id='a', ctx=Load()))
], decorator_list=[Name(id='jit', ctx=Load())])
])
AST to IL ASM
class Visitor(ast.NodeVisitor):
def __init__(self):
self.ops = []
...
...
def visit_Assign(self, node):
if isinstance(node.value, ast.Num):
self.ops.append('MOV <{}>, {}'.format(node.targets[0].id, node.value.n))
elif isinstance(node.value, ast.Name):
self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.id))
elif isinstance(node.value, ast.BinOp):
self.ops.extend(self.visit_BinOp(node.value))
self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.left.id))
...
AST to IL ASM
class Visitor(ast.NodeVisitor):
def __init__(self):
self.ops = []
...
...
def visit_Assign(self, node):
if isinstance(node.value, ast.Num):
self.ops.append('MOV <{}>, {}'.format(node.targets[0].id, node.value.n))
elif isinstance(node.value, ast.Name):
self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.id))
elif isinstance(node.value, ast.BinOp):
self.ops.extend(self.visit_BinOp(node.value))
self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.left.id))
...
...
Assign(
targets=[Name(id='i', ctx=Store())],
value=Num(n=0)
),
Assign(
targets=[Name(id='a', ctx=Store())],
value=Name(id='b', ctx=Load())
),
...
...
MOV <i>, 0
...
AST to IL ASM
class Visitor(ast.NodeVisitor):
def __init__(self):
self.ops = []
...
...
def visit_Assign(self, node):
if isinstance(node.value, ast.Num):
self.ops.append('MOV <{}>, {}'.format(node.targets[0].id, node.value.n))
elif isinstance(node.value, ast.Name):
self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.id))
elif isinstance(node.value, ast.BinOp):
self.ops.extend(self.visit_BinOp(node.value))
self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.left.id))
...
...
Assign(
targets=[Name(id='i', ctx=Store())],
value=Num(n=0)
),
Assign(
targets=[Name(id='a', ctx=Store())],
value=Name(id='b', ctx=Load())
),
...
...
MOV <i>, 0
MOV <a>, <b>
...
IL ASM to ASM
MOV <a>, 0
MOV <b>, 1
CMP <n>, 1
JNL label0
RET
label0:
MOV <i>, 0
loop0:
MOV <temp>, <a>
MOV <a>, <b>
ADD <temp>, <b>
MOV <b>, <temp>
INC <i>
CMP <i>, <n>
JL loop0
RET
IL ASM to ASM
MOV <a>, 0
MOV <b>, 1
CMP <n>, 1
JNL label0
RET
label0:
MOV <i>, 0
loop0:
MOV <temp>, <a>
MOV <a>, <b>
ADD <temp>, <b>
MOV <b>, <temp>
INC <i>
CMP <i>, <n>
JL loop0
RET
# for x64 system
args_registers = ['rdi', 'rsi', 'rdx', ...]
registers = ['rax', 'rbx', 'rcx', ...]
# return register: rax
def fibonacci(n): n ⇔ rdi
...
return a a ⇔ rax
IL ASM to ASM
MOV rax, 0
MOV rbx, 1
CMP rdi, 1
JNL label0
RET
label0:
MOV rcx, 0
loop0:
MOV rdx, rax
MOV rax, rbx
ADD rdx, rbx
MOV rbx, rdx
INC rcx
CMP rcx, rdi
JL loop0
RET
MOV <a>, 0
MOV <b>, 1
CMP <n>, 1
JNL label0
RET
label0:
MOV <i>, 0
loop0:
MOV <temp>, <a>
MOV <a>, <b>
ADD <temp>, <b>
MOV <b>, <temp>
INC <i>
CMP <i>, <n>
JL loop0
RET
ASM to machine code
MOV rax, 0
MOV rbx, 1
CMP rdi, 1
JNL label0
RET
label0:
MOV rcx, 0
loop0:
MOV rdx, rax
MOV rax, rbx
ADD rdx, rbx
MOV rbx, rdx
INC rcx
CMP rcx, rdi
JL loop0
RET
from pwnlib.asm import asm
code = asm(asm_code, arch='amd64')
ASM to machine code
MOV rax, 0
MOV rbx, 1
CMP rdi, 1
JNL label0
RET
label0:
MOV rcx, 0
loop0:
MOV rdx, rax
MOV rax, rbx
ADD rdx, rbx
MOV rbx, rdx
INC rcx
CMP rcx, rdi
JL loop0
RET
ASM to machine code
MOV rax, 0
MOV rbx, 1
CMP rdi, 1
JNL label0
RET
label0:
MOV rcx, 0
loop0:
MOV rdx, rax
MOV rax, rbx
ADD rdx, rbx
MOV rbx, rdx
INC rcx
CMP rcx, rdi
JL loop0
RET
x48xc7xc0x00x00x00x00
x48xc7xc3x01x00x00x00
x48x83xffx01x7dx01xc3
x48xc7xc1x00x00x00x00
x48x89xc2x48x89xd8x48
x01xdax48x89xd3x48xff
xc1x48x39xf9x7cxecxc3
Create function in memory
1) Allocate memory
Create function in memory
1) Allocate memory
2) Copy machine code to allocated memory
Create function in memory
1) Allocate memory
2) Copy machine code to allocated memory
3) Mark the memory as executable
Create function in memory
1) Allocate memory
2) Copy machine code to allocated memory
3) Mark the memory as executable
Linux: mmap, mprotect
Windows: VirtualAlloc, VirtualProtect
Signatures in C/C++
Linux:
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);
int mprotect(void *addr, size_t len, int prot);
void *memcpy(void *dest, const void *src, size_t n);
int munmap(void *addr, size_t length);
Windows:
LPVOID VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize,
DWORD flAllocationType, DWORD flProtect);
BOOL VirtualProtect(LPVOID lpAddress, SIZE_T dwSize,
DWORD flNewProtect, PDWORD lpflOldProtect);
void *memcpy(void *dest, const void *src, size_t count);
BOOL VirtualFree(LPVOID lpAddress, SIZE_T dwSize, DWORD dwFreeType);
Create function in memory
import ctypes
# Linux
libc = ctypes.CDLL('libc.so.6')
libc.mmap
libc.mprotect
libc.memcpy
libc.munmap
# Windows
ctypes.windll.kernel32.VirtualAlloc
ctypes.windll.kernel32.VirtualProtect
ctypes.cdll.msvcrt.memcpy
ctypes.windll.kernel32.VirtualFree
Create function in memory
mmap_func = libc.mmap
mmap_func.argtype = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int,
ctypes.c_int, ctypes.c_int, ctypes.c_size_t]
mmap_func.restype = ctypes.c_void_p
memcpy_func = libc.memcpy
memcpy_func.argtypes = [ctypes.c_void_p, ctypes.c_void_p, ctypes.c_size_t]
memcpy_func.restype = ctypes.c_char_p
Create function in memory
machine_code = 'x48xc7xc0x00x00x00x00x48xc7xc3x01x00x00x00x48
x83xffx01x7dx01xc3x48xc7xc1x00x00x00x00x48x89xc2x48x89xd8
x48x01xdax48x89xd3x48xffxc1x48x39xf9x7cxecxc3'
machine_code_size = len(machine_code)
addr = mmap_func(None, machine_code_size, PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0)
memcpy_func(addr, machine_code, machine_code_size)
func = ctypes.CFUNCTYPE(ctypes.c_uint64)(addr)
func.argtypes = [ctypes.c_uint32]
Benchmarks
for _ in range(1000000):
fibonacci(n)
n No JIT (s) JIT (s)
0 0,153 0,882
10 1,001 0,878
20 1,805 0,942
30 2,658 0,955
60 4,800 0,928
90 7,117 0,922
500 50,611 1,251
Python 2.7
No JIT
JIT
n No JIT (s) JIT (s)
0 0,150 1,079
10 1,093 0,971
20 2,206 1,135
30 3,313 1,204
60 6,815 1,198
90 10,458 1,270
500 63.949 1,652
for _ in range(1000000):
fibonacci(n)
Python 3.7
No JIT
JIT
Python 2.7 vs 3.7
fibonacci(n=93)
No JIT: 10.524 s
JIT: 1.185 s
JIT ~8.5 times faster
JIT compilation time: ~0.08 s
fibonacci(n=93)
No JIT: 7.942 s
JIT: 0.887 s
JIT ~8.5 times faster
JIT compilation time: ~0.07 s
VS
* fibonacci(n=92) = 0x68a3dd8e61eccfbd
fibonacci(n=93) = 0xa94fad42221f2702
0 LOAD_CONST 1 (0)
3 STORE_FAST 1 (a)
6 LOAD_CONST 2 (1)
9 STORE_FAST 2 (b)
12 LOAD_FAST 0 (n)
15 LOAD_CONST 2 (1)
18 COMPARE_OP 0 (<)
21 POP_JUMP_IF_FALSE 28
24 LOAD_FAST 1 (a)
27 RETURN_VALUE
>> 28 LOAD_CONST 1 (0)
31 STORE_FAST 3 (i)
34 SETUP_LOOP 48 (to 85)
>> 37 LOAD_FAST 3 (i)
40 LOAD_FAST 0 (n)
43 COMPARE_OP 0 (<)
46 POP_JUMP_IF_FALSE 84
49 LOAD_FAST 1 (a)
52 STORE_FAST 4 (temp)
55 LOAD_FAST 2 (b)
58 STORE_FAST 1 (a)
61 LOAD_FAST 4 (temp)
64 LOAD_FAST 2 (b)
67 BINARY_ADD
68 STORE_FAST 2 (b)
71 LOAD_FAST 3 (i)
74 LOAD_CONST 2 (1)
77 INPLACE_ADD
78 STORE_FAST 3 (i)
81 JUMP_ABSOLUTE 37
>> 84 POP_BLOCK
>> 85 LOAD_FAST 1 (a)
88 RETURN_VALUE
MOV rax, 0
MOV rbx, 1
CMP rdi, 1
JNL label0
RET
label0:
MOV rcx, 0
loop0:
MOV rdx, rax
MOV rax, rbx
ADD rdx, rbx
MOV rbx, rdx
INC rcx
CMP rcx, rdi
JL loop0
RET
VS
33 (VM opcodes)
vs
14 (real machine instructions)
No JIT vs JIT
Projects
Numba makes Python code fast
Numba is an open source JIT compiler that translates a subset of Python and
NumPy code into fast machine code
- Parallelization
- SIMD Vectorization
- GPU Acceleration
Numba
from numba import jit
import numpy as np
@jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
def go_fast(a): # Function is compiled to machine code when called the first time
trace = 0
for i in range(a.shape[0]): # Numba likes loops
trace += np.tanh(a[i, i]) # Numba likes NumPy functions
return a + trace # Numba likes NumPy broadcasting
@cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
LLVM — compiler infrastructure project
Tutorial “Building a JIT: Starting out with KaleidoscopeJIT”
LLVMPy — Python bindings for LLVM
LLVMLite project by Numba team — lightweight LLVM Python binding for writing
JIT compilers
LLVM
x86-64 assembler embedded in Python
Portable Efficient Assembly Code-generator in Higher-level Python
PeachPy
from peachpy.x86_64 import *
ADD(eax, 5).encode()
# bytearray(b'x83xc0x05')
MOVAPS(xmm0, xmm1).encode_options()
# [bytearray(b'x0f(xc1'), bytearray(b'x0f)xc8')]
VPSLLVD(ymm0, ymm1, [rsi + 8]).encode_length_options()
# {6: bytearray(b'xc4xe2uGFx08'),
# 7: bytearray(b'xc4xe2uGD&x08'),
# 9: bytearray(b'xc4xe2uGx86x08x00x00x00')}
PyPy
PyPy is a fast, compliant alternative implementation of the Python language
Python programs often run faster on PyPy thanks to its Just-in-Time compiler
PyPy works best when executing long-running programs where a significant
fraction of the time is spent executing Python code
“If you want your code to run faster, you should probably just use PyPy”
— Guido van Rossum (creator of Python)
Other projects
Pyjion — A JIT for Python based upon CoreCLR
Pyston — built using LLVM and modern JIT techniques
Psyco — extension module which can greatly speed up the execution of code
The first just-in-time compiler for Python, now unmaintained and dead
Unladen Swallow — was an attempt to make LLVM be a JIT compiler for CPython
References
1. https://en.wikipedia.org/wiki/Just-in-time_compilation
2. John Aycock: A Brief History of Just-In-Time. ACM Computing Surveys (CSUR) Surveys, volume 35,
issue 2, pages 97-113, June 2003, DOI: 10.1145/857076.857077
3. https://eli.thegreenplace.net/2013/11/05/how-to-jit-an-introduction
4. https://medium.com/starschema-blog/jit-fast-supercharge-tensor-processing-in-python-with-jit-com
pilation-47598de6ee96
5. https://github.com/Gallopsled/pwntools
6. https://numba.pydata.org
7. https://llvm.org/docs/tutorial/BuildingAJIT1.html
8. https://llvmlite.readthedocs.io/en/latest/
9. http://www.llvmpy.org
10. https://github.com/Maratyszcza/PeachPy
11. https://github.com/microsoft/Pyjion
12. https://blog.pyston.org
Thank you

JIT compilation for CPython

  • 1.
    JIT compilation for CPython DmitryAlimov 2019 SPb Python
  • 2.
    JIT compilation andJIT history My experience with JIT in CPython Python projects that use JIT and projects for JIT Outline
  • 3.
    What is JITcompilation
  • 4.
    JIT Just-in-time compilation (akadynamic translation, run-time compilation)
  • 5.
    JIT Just-in-time compilation (akadynamic translation, run-time compilation) The earliest JIT compiler on LISP by John McCarthy in 1960
  • 6.
    JIT Just-in-time compilation (akadynamic translation, run-time compilation) The earliest JIT compiler on LISP by John McCarthy in 1960 Ken Thompson in 1968 used for regex in text editor QED
  • 7.
    JIT Just-in-time compilation (akadynamic translation, run-time compilation) The earliest JIT compiler on LISP by John McCarthy in 1960 Ken Thompson in 1968 used for regex in text editor QED LC2
  • 8.
    JIT Just-in-time compilation (akadynamic translation, run-time compilation) The earliest JIT compiler on LISP by John McCarthy in 1960 Ken Thompson in 1968 used for regex in text editor QED LC2 Smalltalk
  • 9.
    JIT Just-in-time compilation (akadynamic translation, run-time compilation) The earliest JIT compiler on LISP by John McCarthy in 1960 Ken Thompson in 1968 used for regex in text editor QED LC2 Smalltalk Self
  • 10.
    JIT Just-in-time compilation (akadynamic translation, run-time compilation) The earliest JIT compiler on LISP by John McCarthy in 1960 Ken Thompson in 1968 used for regex in text editor QED LC2 Smalltalk Self Popularized by Java with James Gosling using the term from 1993
  • 11.
    JIT Just-in-time compilation (akadynamic translation, run-time compilation) The earliest JIT compiler on LISP by John McCarthy in 1960 Ken Thompson in 1968 used for regex in text editor QED LC2 Smalltalk Self Popularized by Java with James Gosling using the term from 1993 Just-in-time manufacturing, also known as just-in-time production or the Toyota Production System (TPS)
  • 12.
  • 13.
    Example def fibonacci(n): """Returns n-thFibonacci number""" a = 0 b = 1 if n < 1: return a i = 0 while i < n: temp = a a = b b = temp + b i += 1 return a Fibonacci Sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ...
  • 14.
    Let’s JIT it 1)Convert function to machine code at run-time
  • 15.
    Let’s JIT it 1)Convert function to machine code at run-time 2) Execute this machine code
  • 16.
    Let’s JIT it @jit deffibonacci(n): """Returns n-th Fibonacci number""" a = 0 b = 1 if n < 1: return a i = 0 while i < n: temp = a a = b b = temp + b i += 1 return a
  • 17.
    Convert function toAST import ast import inspect lines = inspect.getsource(func) node = ast.parse(lines) visitor = Visitor() visitor.visit(node)
  • 18.
    AST Module(body=[ FunctionDef(name='fibonacci', args=arguments(args=[Name(id='n', ctx=Param())], vararg=None,kwarg=None, defaults=[]), body=[ Expr(value=Str(s='Returns n-th Fibonacci number')), Assign(targets=[Name(id='a', ctx=Store())], value=Num(n=0)), Assign(targets=[Name(id='b', ctx=Store())], value=Num(n=1)), If(test=Compare(left=Name(id='n', ctx=Load()), ops=[Lt()], comparators=[Num(n=1)]), body=[ Return(value=Name(id='a', ctx=Load())) ], orelse=[]), Assign(targets=[Name(id='i', ctx=Store())], value=Num(n=0)), While(test=Compare(left=Name(id='i', ctx=Load()), ops=[Lt()], comparators=[Name(id='n', ctx=Load())]), body=[ Assign(targets=[Name(id='temp', ctx=Store())], value=Name(id='a', ctx=Load())), Assign(targets=[Name(id='a', ctx=Store())], value=Name(id='b', ctx=Load())), Assign(targets=[Name(id='b', ctx=Store())], value=BinOp( left=Name(id='temp', ctx=Load()), op=Add(), right=Name(id='b', ctx=Load()))), AugAssign(target=Name(id='i', ctx=Store()), op=Add(), value=Num(n=1)) ], orelse=[]), Return(value=Name(id='a', ctx=Load())) ], decorator_list=[Name(id='jit', ctx=Load())]) ])
  • 19.
    AST to ILASM class Visitor(ast.NodeVisitor): def __init__(self): self.ops = [] ... ... def visit_Assign(self, node): if isinstance(node.value, ast.Num): self.ops.append('MOV <{}>, {}'.format(node.targets[0].id, node.value.n)) elif isinstance(node.value, ast.Name): self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.id)) elif isinstance(node.value, ast.BinOp): self.ops.extend(self.visit_BinOp(node.value)) self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.left.id)) ...
  • 20.
    AST to ILASM class Visitor(ast.NodeVisitor): def __init__(self): self.ops = [] ... ... def visit_Assign(self, node): if isinstance(node.value, ast.Num): self.ops.append('MOV <{}>, {}'.format(node.targets[0].id, node.value.n)) elif isinstance(node.value, ast.Name): self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.id)) elif isinstance(node.value, ast.BinOp): self.ops.extend(self.visit_BinOp(node.value)) self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.left.id)) ... ... Assign( targets=[Name(id='i', ctx=Store())], value=Num(n=0) ), Assign( targets=[Name(id='a', ctx=Store())], value=Name(id='b', ctx=Load()) ), ... ... MOV <i>, 0 ...
  • 21.
    AST to ILASM class Visitor(ast.NodeVisitor): def __init__(self): self.ops = [] ... ... def visit_Assign(self, node): if isinstance(node.value, ast.Num): self.ops.append('MOV <{}>, {}'.format(node.targets[0].id, node.value.n)) elif isinstance(node.value, ast.Name): self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.id)) elif isinstance(node.value, ast.BinOp): self.ops.extend(self.visit_BinOp(node.value)) self.ops.append('MOV <{}>, <{}>'.format(node.targets[0].id, node.value.left.id)) ... ... Assign( targets=[Name(id='i', ctx=Store())], value=Num(n=0) ), Assign( targets=[Name(id='a', ctx=Store())], value=Name(id='b', ctx=Load()) ), ... ... MOV <i>, 0 MOV <a>, <b> ...
  • 22.
    IL ASM toASM MOV <a>, 0 MOV <b>, 1 CMP <n>, 1 JNL label0 RET label0: MOV <i>, 0 loop0: MOV <temp>, <a> MOV <a>, <b> ADD <temp>, <b> MOV <b>, <temp> INC <i> CMP <i>, <n> JL loop0 RET
  • 23.
    IL ASM toASM MOV <a>, 0 MOV <b>, 1 CMP <n>, 1 JNL label0 RET label0: MOV <i>, 0 loop0: MOV <temp>, <a> MOV <a>, <b> ADD <temp>, <b> MOV <b>, <temp> INC <i> CMP <i>, <n> JL loop0 RET # for x64 system args_registers = ['rdi', 'rsi', 'rdx', ...] registers = ['rax', 'rbx', 'rcx', ...] # return register: rax def fibonacci(n): n ⇔ rdi ... return a a ⇔ rax
  • 24.
    IL ASM toASM MOV rax, 0 MOV rbx, 1 CMP rdi, 1 JNL label0 RET label0: MOV rcx, 0 loop0: MOV rdx, rax MOV rax, rbx ADD rdx, rbx MOV rbx, rdx INC rcx CMP rcx, rdi JL loop0 RET MOV <a>, 0 MOV <b>, 1 CMP <n>, 1 JNL label0 RET label0: MOV <i>, 0 loop0: MOV <temp>, <a> MOV <a>, <b> ADD <temp>, <b> MOV <b>, <temp> INC <i> CMP <i>, <n> JL loop0 RET
  • 25.
    ASM to machinecode MOV rax, 0 MOV rbx, 1 CMP rdi, 1 JNL label0 RET label0: MOV rcx, 0 loop0: MOV rdx, rax MOV rax, rbx ADD rdx, rbx MOV rbx, rdx INC rcx CMP rcx, rdi JL loop0 RET
  • 26.
    from pwnlib.asm importasm code = asm(asm_code, arch='amd64') ASM to machine code MOV rax, 0 MOV rbx, 1 CMP rdi, 1 JNL label0 RET label0: MOV rcx, 0 loop0: MOV rdx, rax MOV rax, rbx ADD rdx, rbx MOV rbx, rdx INC rcx CMP rcx, rdi JL loop0 RET
  • 27.
    ASM to machinecode MOV rax, 0 MOV rbx, 1 CMP rdi, 1 JNL label0 RET label0: MOV rcx, 0 loop0: MOV rdx, rax MOV rax, rbx ADD rdx, rbx MOV rbx, rdx INC rcx CMP rcx, rdi JL loop0 RET x48xc7xc0x00x00x00x00 x48xc7xc3x01x00x00x00 x48x83xffx01x7dx01xc3 x48xc7xc1x00x00x00x00 x48x89xc2x48x89xd8x48 x01xdax48x89xd3x48xff xc1x48x39xf9x7cxecxc3
  • 28.
    Create function inmemory 1) Allocate memory
  • 29.
    Create function inmemory 1) Allocate memory 2) Copy machine code to allocated memory
  • 30.
    Create function inmemory 1) Allocate memory 2) Copy machine code to allocated memory 3) Mark the memory as executable
  • 31.
    Create function inmemory 1) Allocate memory 2) Copy machine code to allocated memory 3) Mark the memory as executable Linux: mmap, mprotect Windows: VirtualAlloc, VirtualProtect
  • 32.
    Signatures in C/C++ Linux: void*mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); int mprotect(void *addr, size_t len, int prot); void *memcpy(void *dest, const void *src, size_t n); int munmap(void *addr, size_t length); Windows: LPVOID VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize, DWORD flAllocationType, DWORD flProtect); BOOL VirtualProtect(LPVOID lpAddress, SIZE_T dwSize, DWORD flNewProtect, PDWORD lpflOldProtect); void *memcpy(void *dest, const void *src, size_t count); BOOL VirtualFree(LPVOID lpAddress, SIZE_T dwSize, DWORD dwFreeType);
  • 33.
    Create function inmemory import ctypes # Linux libc = ctypes.CDLL('libc.so.6') libc.mmap libc.mprotect libc.memcpy libc.munmap # Windows ctypes.windll.kernel32.VirtualAlloc ctypes.windll.kernel32.VirtualProtect ctypes.cdll.msvcrt.memcpy ctypes.windll.kernel32.VirtualFree
  • 34.
    Create function inmemory mmap_func = libc.mmap mmap_func.argtype = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_size_t] mmap_func.restype = ctypes.c_void_p memcpy_func = libc.memcpy memcpy_func.argtypes = [ctypes.c_void_p, ctypes.c_void_p, ctypes.c_size_t] memcpy_func.restype = ctypes.c_char_p
  • 35.
    Create function inmemory machine_code = 'x48xc7xc0x00x00x00x00x48xc7xc3x01x00x00x00x48 x83xffx01x7dx01xc3x48xc7xc1x00x00x00x00x48x89xc2x48x89xd8 x48x01xdax48x89xd3x48xffxc1x48x39xf9x7cxecxc3' machine_code_size = len(machine_code) addr = mmap_func(None, machine_code_size, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0) memcpy_func(addr, machine_code, machine_code_size) func = ctypes.CFUNCTYPE(ctypes.c_uint64)(addr) func.argtypes = [ctypes.c_uint32]
  • 36.
  • 37.
    for _ inrange(1000000): fibonacci(n) n No JIT (s) JIT (s) 0 0,153 0,882 10 1,001 0,878 20 1,805 0,942 30 2,658 0,955 60 4,800 0,928 90 7,117 0,922 500 50,611 1,251 Python 2.7 No JIT JIT
  • 38.
    n No JIT(s) JIT (s) 0 0,150 1,079 10 1,093 0,971 20 2,206 1,135 30 3,313 1,204 60 6,815 1,198 90 10,458 1,270 500 63.949 1,652 for _ in range(1000000): fibonacci(n) Python 3.7 No JIT JIT
  • 39.
    Python 2.7 vs3.7 fibonacci(n=93) No JIT: 10.524 s JIT: 1.185 s JIT ~8.5 times faster JIT compilation time: ~0.08 s fibonacci(n=93) No JIT: 7.942 s JIT: 0.887 s JIT ~8.5 times faster JIT compilation time: ~0.07 s VS * fibonacci(n=92) = 0x68a3dd8e61eccfbd fibonacci(n=93) = 0xa94fad42221f2702
  • 40.
    0 LOAD_CONST 1(0) 3 STORE_FAST 1 (a) 6 LOAD_CONST 2 (1) 9 STORE_FAST 2 (b) 12 LOAD_FAST 0 (n) 15 LOAD_CONST 2 (1) 18 COMPARE_OP 0 (<) 21 POP_JUMP_IF_FALSE 28 24 LOAD_FAST 1 (a) 27 RETURN_VALUE >> 28 LOAD_CONST 1 (0) 31 STORE_FAST 3 (i) 34 SETUP_LOOP 48 (to 85) >> 37 LOAD_FAST 3 (i) 40 LOAD_FAST 0 (n) 43 COMPARE_OP 0 (<) 46 POP_JUMP_IF_FALSE 84 49 LOAD_FAST 1 (a) 52 STORE_FAST 4 (temp) 55 LOAD_FAST 2 (b) 58 STORE_FAST 1 (a) 61 LOAD_FAST 4 (temp) 64 LOAD_FAST 2 (b) 67 BINARY_ADD 68 STORE_FAST 2 (b) 71 LOAD_FAST 3 (i) 74 LOAD_CONST 2 (1) 77 INPLACE_ADD 78 STORE_FAST 3 (i) 81 JUMP_ABSOLUTE 37 >> 84 POP_BLOCK >> 85 LOAD_FAST 1 (a) 88 RETURN_VALUE MOV rax, 0 MOV rbx, 1 CMP rdi, 1 JNL label0 RET label0: MOV rcx, 0 loop0: MOV rdx, rax MOV rax, rbx ADD rdx, rbx MOV rbx, rdx INC rcx CMP rcx, rdi JL loop0 RET VS 33 (VM opcodes) vs 14 (real machine instructions) No JIT vs JIT
  • 41.
  • 42.
    Numba makes Pythoncode fast Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code - Parallelization - SIMD Vectorization - GPU Acceleration Numba
  • 43.
    from numba importjit import numpy as np @jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit def go_fast(a): # Function is compiled to machine code when called the first time trace = 0 for i in range(a.shape[0]): # Numba likes loops trace += np.tanh(a[i, i]) # Numba likes NumPy functions return a + trace # Numba likes NumPy broadcasting @cuda.jit def matmul(A, B, C): """Perform square matrix multiplication of C = A * B """ i, j = cuda.grid(2) if i < C.shape[0] and j < C.shape[1]: tmp = 0. for k in range(A.shape[1]): tmp += A[i, k] * B[k, j] C[i, j] = tmp
  • 44.
    LLVM — compilerinfrastructure project Tutorial “Building a JIT: Starting out with KaleidoscopeJIT” LLVMPy — Python bindings for LLVM LLVMLite project by Numba team — lightweight LLVM Python binding for writing JIT compilers LLVM
  • 45.
    x86-64 assembler embeddedin Python Portable Efficient Assembly Code-generator in Higher-level Python PeachPy from peachpy.x86_64 import * ADD(eax, 5).encode() # bytearray(b'x83xc0x05') MOVAPS(xmm0, xmm1).encode_options() # [bytearray(b'x0f(xc1'), bytearray(b'x0f)xc8')] VPSLLVD(ymm0, ymm1, [rsi + 8]).encode_length_options() # {6: bytearray(b'xc4xe2uGFx08'), # 7: bytearray(b'xc4xe2uGD&x08'), # 9: bytearray(b'xc4xe2uGx86x08x00x00x00')}
  • 46.
    PyPy PyPy is afast, compliant alternative implementation of the Python language Python programs often run faster on PyPy thanks to its Just-in-Time compiler PyPy works best when executing long-running programs where a significant fraction of the time is spent executing Python code “If you want your code to run faster, you should probably just use PyPy” — Guido van Rossum (creator of Python)
  • 47.
    Other projects Pyjion —A JIT for Python based upon CoreCLR Pyston — built using LLVM and modern JIT techniques Psyco — extension module which can greatly speed up the execution of code The first just-in-time compiler for Python, now unmaintained and dead Unladen Swallow — was an attempt to make LLVM be a JIT compiler for CPython
  • 48.
    References 1. https://en.wikipedia.org/wiki/Just-in-time_compilation 2. JohnAycock: A Brief History of Just-In-Time. ACM Computing Surveys (CSUR) Surveys, volume 35, issue 2, pages 97-113, June 2003, DOI: 10.1145/857076.857077 3. https://eli.thegreenplace.net/2013/11/05/how-to-jit-an-introduction 4. https://medium.com/starschema-blog/jit-fast-supercharge-tensor-processing-in-python-with-jit-com pilation-47598de6ee96 5. https://github.com/Gallopsled/pwntools 6. https://numba.pydata.org 7. https://llvm.org/docs/tutorial/BuildingAJIT1.html 8. https://llvmlite.readthedocs.io/en/latest/ 9. http://www.llvmpy.org 10. https://github.com/Maratyszcza/PeachPy 11. https://github.com/microsoft/Pyjion 12. https://blog.pyston.org
  • 49.