The document discusses high performance Python tools for profiling and optimizing code performance. It profiles a Python function using cProfile, line_profiler and memory_profiler. It then demonstrates optimizing the function using Cython, Pythran and Numba to achieve significant speedups over the pure Python version. The document argues that automation tools are valuable for high performance Python due to reduced developer costs compared to manual optimization.
2. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
What is “high performance”?
●
Profiling to understand system behaviour
●
We often ignore this step...
●
Speeding up the bottleneck
●
Keeps you on 1 machine (if possible)
●
Keeping team speed high
9. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
Profiling possibilities
●
CPU (line by line or by function)
●
Memory (line by line)
●
Disk read/write (with some hacking)
●
Network read/write (with some hacking)
●
mmaps
●
File handles
●
Network connections
●
Cache utilisation via libperf?
10. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
Cython 0.20 (pyx annotations)
#cython: boundscheck=False
def calculate_z(int maxiter, zs, cs):
"""Calculate output list using Julia update rule"""
cdef unsigned int i, n
cdef double complex z, c
output = [0] * len(zs)
for i in range(len(zs)):
n = 0
z = zs[i]
c = cs[i]
while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
z = z * z + c
n += 1
output[i] = n
return output
Pure CPython lists code 12s
Cython lists runtime 0.19s
Cython numpy runtime 0.16s
11. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
Cython + numpy + OMP nogil
#cython: boundscheck=False
from cython.parallel import parallel, prange
import numpy as np
cimport numpy as np
def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
cdef unsigned int i, length, n
cdef double complex z, c
cdef int[:] output = np.empty(len(zs), dtype=np.int32)
length = len(zs)
with nogil, parallel():
for i in prange(length, schedule="guided"):
z = zs[i]
c = cs[i]
n = 0
while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
z = z * z + c
n = n + 1
output[i] = n
return output
Runtime 0.05s
12. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
ShedSkin 0.9.4 annotations
def calculate_z(maxiter, zs, cs): # maxiter: [int], zs:
[list(complex)], cs: [list(complex)]
output = [0] * len(zs) # [list(int)]
for i in range(len(zs)): # [__iter(int)]
n = 0 # [int]
z = zs[i] # [complex]
c = cs[i] # [complex]
while n < maxiter and (… <4): # [complex]
z = z * z + c # [complex]
n += 1 # [int]
output[i] = n # [int]
return output # [list(int)]
Couldn't we generate Cython pyx? Runtime 0.22s
13. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
Pythran (0.40)
#pythran export calculate_z_serial_purepython(int,
complex list, complex list)
def calculate_z_serial_purepython(maxiter, zs, cs):
…
Support for OpenMP on numpy arrays
Author Serge made an overnight fix – superb
support!
List Runtime 0.4s
#pythran export calculate_z(int, complex[], complex[], int[])
…
#omp parallel for schedule(dynamic)
OMP numpy Runtime 0.10s
14. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
PyPy nightly (and numpypy)
●
“It just works” on Python 2.7 code
●
Clever list strategies (e.g. unboxed, uniform)
●
Little support for pre-existing C extensions (e.g.
the existing numpy)
●
multiprocessing, IPython etc all work fine
●
Python list code runtime: 0.3s
●
(pypy)numpy support is incomplete, bugs are
tackled (numpy runtime 5s [CPython+numpy 56s])
15. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
Numba 0.12
from numba import jit
@jit(nopython=True)
def calculate_z_serial_purepython(maxiter, zs, cs, output):
# couldn't create output, had to pass it in
# output = numpy.zeros(len(zs), dtype=np.int32)
for i in xrange(len(zs)):
n = 0
z = zs[i]
c = cs[i]
#while n < maxiter and abs(z) < 2: # abs unrecognised
while n < maxiter and z.real * z.real + z.imag * z.imag < 4:
z = z * z + c
n += 1
output[i] = n
#return output
Runtime 0.4s
Some Python 3 support, some GPU
prange support missing (was in 0.11)?
0.12 introduces temp limitations
16. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
Tool Tradeoffs
●
PyPy no learning curve (pure Py only) easy win?
●
ShedSkin easy (pure Py only) but fairly rare
●
Cython pure Py hours to learn – team cost low (and lots of
online help)
●
Cython numpy OMP days+ to learn – heavy team cost?
●
Numba/Pythran hours to learn, install a bit tricky (Anaconda
easiest for Numba)
●
Pythran OMP very impressive result for little effort
●
Numba big toolchain which might hurt productivity?
●
(numexpr not covered – great for numpy and easy to use)
17. Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
Wrap up
●
Our profiling options should be richer
●
4-12 physical CPU cores commonplace
●
Cost of hand-annotating code is reduced agility
●
JITs/AST compilers are getting fairly good, manual
intervention still gives best results
BUT! CONSIDER:
●
Automation should (probably) be embraced ($CPUs
< $humans) as team velocity is probably higher