Compiling Python to
Native Code for Speed
and Scale
David Kammeyer
Continuum Analytics
kammeyer@continuum.io
Tuesday, June 4, 13
Continuum Background
• Python for Big Data and Science
• Founded by Travis Oliphant
(Creator of NumPy) and Peter
Wang in 2012
• 45 Employees
Tuesday, June 4, 13
Enterprise
Python
Scientific
Computing
Data Processing
Data Analysis
Visualisation
Scalable
Computing
• Products
• Training
• Support
• Consulting
About Continuum Analytics
Tuesday, June 4, 13
Products
Anaconda: Easy to install Python distribution, including the
most popular open-source scientific and mathematical
libraries. (Free!)
Accelerate: Opens up the full capabilities of the GPU or
multi-core processor to Python.
IOPro: fast loading of data from files, SQL, and NoSQL
stores, improving performance and reducing memory
overhead.
Wakari: Browser-based Python and Linux environment for
collaborative data analysis, exploration, and visualization.
(Small Instance is Free!)
Tuesday, June 4, 13
Open Source Projects
Blaze: High-performance Python library for modern
vector computing, distributed and streaming data
Bokeh: Interactive, grammar-based visualization
system for large datasets
Numba:Vectorizing Python compiler for multicore
and GPU, using LLVM
Tuesday, June 4, 13
Numba
• Just-in-time, dynamic compiler for Python
• Optimize data-parallel computations at call time,
to take advantage of local hardware configuration
• Compatible with NumPy, Blaze
• Leverage LLVM ecosystem:
• Optimization passes
• Inter-op with other languages
• Variety of backends (e.g. CUDA for GPU support)
Tuesday, June 4, 13
LLVM
LLVM IR
x86
C++
ARM
PTX
C
Fortran
Python
• Leverage LLVM ecosystem:
• Optimization passes
• Inter-op with other languages
• Variety of backends (e.g. CUDA for GPU support)
Tuesday, June 4, 13
Simple API
#@jit('void(double[:,:], double, double)')
@autojit
def numba_update(u, dx2, dy2):
nx, ny = u.shape
for i in xrange(1,nx-1):
for j in xrange(1, ny-1):
u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 +
(u[i,j+1] + u[i,j-1]) * dx2) /
(2*(dx2+dy2))
Comment out one of jit or autojit (don’t use together)
• jit --- provide type information (fastest to call at run-time)
• autojit --- detects input types, infers output, generates code
if needed, and dispatches (a little more run-time call
overhead)
Tuesday, June 4, 13
Example
@jit(‘f8(f8)’)
def sinc(x):
if x==0.0:
return 1.0
else:
return sin(x*pi)/(pi*x)
Numba
Tuesday, June 4, 13
Compile NumPy array expressions
from numba import autojit
@autojit
def formula(a, b, c):
a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1]
@autojit
def express(m1, m2):
m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2]
* m1[-2:1:-2,...,::2])
return m2
Tuesday, June 4, 13
Fast vectorize
NumPy’s ufuncs take “kernels” and
apply the kernel element-by-element
over entire arrays Write kernels in
Python!
from numbapro import vectorize
from math import sin
@vectorize([‘f8(f8)’, ‘f4(f4)’])
def sinc(x):
if x==0.0:
return 1.0
else:
return sin(x*pi)/(pi*x)
Tuesday, June 4, 13
Create parallel-for loops
“prange” directive that spawns compiled tasks
in threads (like Open-MP parallel-for pragma)
import numbapro
from numba import autojit, prange
@autojit
def parallel_sum2d(a):
sum = 0.0
for i in prange(a.shape[0]):
for j in range(a.shape[1]):
sum += a[i,j]
Tuesday, June 4, 13
Example: MandelbrotVectorized
from numbapro import vectorize
sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32,
uint32)'
@vectorize([sig], target='gpu')
def mandel(tid, min_x, max_x, min_y, max_y, width,
height, iters):
pixel_size_x = (max_x - min_x) / width
pixel_size_y = (max_y - min_y) / height
x = tid % width
y = tid / width
real = min_x + x * pixel_size_x
imag = min_y + y * pixel_size_y
c = complex(real, imag)
z = 0.0j
for i in range(iters):
z = z * z + c
if (z.real * z.real + z.imag * z.imag) >= 4:
return i
return 255
Kind Time Speed-up
Python 263.6 1.0x
CPU 2.639 100x
GPU 0.1676 1573x
Tesla S2050
Tuesday, June 4, 13
Many More Advanced Features!
• Extension classes (jit a class -- autojit coming soon!)
• Struct support (NumPy arrays can be structs)
• SSA -- can refer to local variables as different types
• Typed lists and typed dictionaries and sets coming
soon!
• Calling ctypes and CFFI functions natively
• pycc (create stand-alone dynamic library and
executable)
• pycc --python (create static extension module for
Python)
Tuesday, June 4, 13
Availability
•Core is Open Source
•github.com/numba/numba
•GPU Compiliation and Parallelization
available in Anaconda Accelerate, €100.
Tuesday, June 4, 13
Questions?
http://continuum.io
kammeyer@continuum.io
Tuesday, June 4, 13

Buzzwords Numba Presentation

  • 1.
    Compiling Python to NativeCode for Speed and Scale David Kammeyer Continuum Analytics kammeyer@continuum.io Tuesday, June 4, 13
  • 2.
    Continuum Background • Pythonfor Big Data and Science • Founded by Travis Oliphant (Creator of NumPy) and Peter Wang in 2012 • 45 Employees Tuesday, June 4, 13
  • 3.
    Enterprise Python Scientific Computing Data Processing Data Analysis Visualisation Scalable Computing •Products • Training • Support • Consulting About Continuum Analytics Tuesday, June 4, 13
  • 4.
    Products Anaconda: Easy toinstall Python distribution, including the most popular open-source scientific and mathematical libraries. (Free!) Accelerate: Opens up the full capabilities of the GPU or multi-core processor to Python. IOPro: fast loading of data from files, SQL, and NoSQL stores, improving performance and reducing memory overhead. Wakari: Browser-based Python and Linux environment for collaborative data analysis, exploration, and visualization. (Small Instance is Free!) Tuesday, June 4, 13
  • 5.
    Open Source Projects Blaze:High-performance Python library for modern vector computing, distributed and streaming data Bokeh: Interactive, grammar-based visualization system for large datasets Numba:Vectorizing Python compiler for multicore and GPU, using LLVM Tuesday, June 4, 13
  • 6.
    Numba • Just-in-time, dynamiccompiler for Python • Optimize data-parallel computations at call time, to take advantage of local hardware configuration • Compatible with NumPy, Blaze • Leverage LLVM ecosystem: • Optimization passes • Inter-op with other languages • Variety of backends (e.g. CUDA for GPU support) Tuesday, June 4, 13
  • 7.
    LLVM LLVM IR x86 C++ ARM PTX C Fortran Python • LeverageLLVM ecosystem: • Optimization passes • Inter-op with other languages • Variety of backends (e.g. CUDA for GPU support) Tuesday, June 4, 13
  • 8.
    Simple API #@jit('void(double[:,:], double,double)') @autojit def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Comment out one of jit or autojit (don’t use together) • jit --- provide type information (fastest to call at run-time) • autojit --- detects input types, infers output, generates code if needed, and dispatches (a little more run-time call overhead) Tuesday, June 4, 13
  • 9.
    Example @jit(‘f8(f8)’) def sinc(x): if x==0.0: return1.0 else: return sin(x*pi)/(pi*x) Numba Tuesday, June 4, 13
  • 10.
    Compile NumPy arrayexpressions from numba import autojit @autojit def formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1] @autojit def express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2 Tuesday, June 4, 13
  • 11.
    Fast vectorize NumPy’s ufuncstake “kernels” and apply the kernel element-by-element over entire arrays Write kernels in Python! from numbapro import vectorize from math import sin @vectorize([‘f8(f8)’, ‘f4(f4)’]) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) Tuesday, June 4, 13
  • 12.
    Create parallel-for loops “prange”directive that spawns compiled tasks in threads (like Open-MP parallel-for pragma) import numbapro from numba import autojit, prange @autojit def parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j] Tuesday, June 4, 13
  • 13.
    Example: MandelbrotVectorized from numbaproimport vectorize sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255 Kind Time Speed-up Python 263.6 1.0x CPU 2.639 100x GPU 0.1676 1573x Tesla S2050 Tuesday, June 4, 13
  • 14.
    Many More AdvancedFeatures! • Extension classes (jit a class -- autojit coming soon!) • Struct support (NumPy arrays can be structs) • SSA -- can refer to local variables as different types • Typed lists and typed dictionaries and sets coming soon! • Calling ctypes and CFFI functions natively • pycc (create stand-alone dynamic library and executable) • pycc --python (create static extension module for Python) Tuesday, June 4, 13
  • 15.
    Availability •Core is OpenSource •github.com/numba/numba •GPU Compiliation and Parallelization available in Anaconda Accelerate, €100. Tuesday, June 4, 13
  • 16.