The High Performance Python Landscape by Ian Ozsvald

The High Performance Python Landscape by Ian Ozsvald

### The High Performance Python Landscape by Ian Ozsvald

1. 1. www.morconsulting.c The High Performance Python Landscape - profiling and fast calculation Ian Ozsvald @IanOzsvald MorConsulting.com
2. 2. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 What is “high performance”? ● Profiling to understand system behaviour ● We often ignore this step... ● Speeding up the bottleneck ● Keeps you on 1 machine (if possible) ● Keeping team speed high
3. 3. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 “High Performance Python” • “Practical Performant Programming for Humans” • Please join the mailing list via IanOzsvald.com
4. 4. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 cProfile
5. 5. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 line_profiler Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================      9                                           @profile     10                                           def calculate_z_serial_purepython(                                                       maxiter, zs, cs):     12         1         6870   6870.0      0.0      output = [0] * len(zs)     13   1000001       781959      0.8      0.8      for i in range(len(zs)):     14   1000000       767224      0.8      0.8          n = 0     15   1000000       843432      0.8      0.8          z = zs[i]     16   1000000       786013      0.8      0.8          c = cs[i]     17  34219980     36492596      1.1     36.2          while abs(z) < 2                                                                 and n < maxiter:     18  33219980     32869046      1.0     32.6              z = z * z + c     19  33219980     27371730      0.8     27.2              n += 1     20   1000000       890837      0.9      0.9          output[i] = n     21         1            4      4.0      0.0      return output
6. 6. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 memory_profiler Line #    Mem usage    Increment   Line Contents ================================================      9   89.934 MiB    0.000 MiB   @profile     10                             def calculate_z_serial_purepython(                                                      maxiter, zs, cs):                                       12   97.566 MiB    7.633 MiB       output = [0] * len(zs)     13  130.215 MiB   32.648 MiB       for i in range(len(zs)):     14  130.215 MiB    0.000 MiB           n = 0     15  130.215 MiB    0.000 MiB           z = zs[i]     16  130.215 MiB    0.000 MiB           c = cs[i]     17  130.215 MiB    0.000 MiB           while n < maxiter and abs(z) < 2:     18  130.215 MiB    0.000 MiB               z = z * z + c     19  130.215 MiB    0.000 MiB               n += 1     20  130.215 MiB    0.000 MiB           output[i] = n     21  122.582 MiB   ­7.633 MiB       return output
7. 7. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 memory_profiler mprof https://github.com/scikit-learn/scikit-l earn/pull/2248 Before & After an improvement
8. 8. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 Transforming memory_profiler into a resource profiler?
9. 9. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 Profiling possibilities ● CPU (line by line or by function) ● Memory (line by line) ● Disk read/write (with some hacking) ● Network read/write (with some hacking) ● mmaps ● File handles ● Network connections ● Cache utilisation via libperf?
10. 10. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 Cython 0.20 (pyx annotations) #cython: boundscheck=False def calculate_z(int maxiter, zs, cs):     """Calculate output list using Julia update rule"""     cdef unsigned int i, n     cdef double complex z, c     output = [0] * len(zs)     for i in range(len(zs)):         n = 0         z = zs[i]         c = cs[i]         while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:             z = z * z + c             n += 1         output[i] = n     return output Pure CPython lists code 12s Cython lists runtime 0.19s Cython numpy runtime 0.16s
11. 11. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 Cython + numpy + OMP nogil #cython: boundscheck=False from cython.parallel import parallel, prange import numpy as np cimport numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):     cdef unsigned int i, length, n     cdef double complex z, c     cdef int[:] output = np.empty(len(zs), dtype=np.int32)     length = len(zs)     with nogil, parallel():         for i in prange(length, schedule="guided"):             z = zs[i]             c = cs[i]             n = 0             while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:                 z = z * z + c                 n = n + 1             output[i] = n     return output Runtime 0.05s
12. 12. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 ShedSkin 0.9.4 annotations def calculate_z(maxiter, zs, cs):        # maxiter: [int], zs:                             [list(complex)], cs: [list(complex)]     output = [0] * len(zs)               # [list(int)]     for i in range(len(zs)):             # [__iter(int)]         n = 0                            # [int]         z = zs[i]                        # [complex]         c = cs[i]                        # [complex]         while n < maxiter and (… <4):    # [complex]             z = z * z + c                # [complex]             n += 1                       # [int]         output[i] = n                    # [int]     return output                        # [list(int)] Couldn't we generate Cython pyx? Runtime 0.22s
13. 13. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 Pythran (0.40) #pythran export calculate_z_serial_purepython(int,  complex list, complex list) def calculate_z_serial_purepython(maxiter, zs, cs):  …  Support for OpenMP on numpy arrays Author Serge made an overnight fix – superb support! List Runtime 0.4s #pythran export calculate_z(int, complex[], complex[], int[]) …  #omp parallel for schedule(dynamic) OMP numpy Runtime 0.10s
14. 14. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 PyPy nightly (and numpypy) ● “It just works” on Python 2.7 code ● Clever list strategies (e.g. unboxed, uniform) ● Little support for pre-existing C extensions (e.g. the existing numpy) ● multiprocessing, IPython etc all work fine ● Python list code runtime: 0.3s ● (pypy)numpy support is incomplete, bugs are tackled (numpy runtime 5s [CPython+numpy 56s])
15. 15. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 Numba 0.12 from numba import jit @jit(nopython=True) def calculate_z_serial_purepython(maxiter, zs, cs, output):     # couldn't create output, had to pass it in     # output = numpy.zeros(len(zs), dtype=np.int32)     for i in xrange(len(zs)):         n = 0         z = zs[i]         c = cs[i]         #while n < maxiter and abs(z) < 2:  # abs unrecognised         while n < maxiter and z.real * z.real + z.imag * z.imag < 4:             z = z * z + c             n += 1         output[i] = n     #return output Runtime 0.4s Some Python 3 support, some GPU prange support missing (was in 0.11)? 0.12 introduces temp limitations
16. 16. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 Tool Tradeoffs ● PyPy no learning curve (pure Py only) easy win? ● ShedSkin easy (pure Py only) but fairly rare ● Cython pure Py hours to learn – team cost low (and lots of online help) ● Cython numpy OMP days+ to learn – heavy team cost? ● Numba/Pythran hours to learn, install a bit tricky (Anaconda easiest for Numba) ● Pythran OMP very impressive result for little effort ● Numba big toolchain which might hurt productivity? ● (numexpr not covered – great for numpy and easy to use)
17. 17. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 Wrap up ● Our profiling options should be richer ● 4-12 physical CPU cores commonplace ● Cost of hand-annotating code is reduced agility ● JITs/AST compilers are getting fairly good, manual intervention still gives best results BUT! CONSIDER: ● Automation should (probably) be embraced (\$CPUs < \$humans) as team velocity is probably higher
18. 18. Ian@MorConsulting.com @IanOzsvald PyDataLondon February 2014 Thank You • Ian@IanOzsvald.com • @IanOzsvald • MorConsulting.com • Annotate.io • GitHub/IanOzsvald