PyData Boston 2013

Continuum Analytics: What
we are doing and why
Travis Oliphant, PhD
Continuum Analytics, Inc

This talk will be about (mostly) the free and /
or open source software we are building.
Enterprise
Python
Scientiﬁc
Computing
Data Processing
Data Analysis
Visualisation
Scalable
Computing
• Products
• Training
• Support
• Consulting

Began operations in January of 2012 with 5 people
with big dreams
Python

We are big backers of NumFOCUS and
organizers of PyData
Spyder

Our Team
Scientists Developers
+5 contractors
+4 interns
+1 Part-time
Business

Big Picture
expertise
insights

NumPy and SciPy are quite successful
Thanks to large and diverse community around it
(Matplotlib, IPython, SymPy, Pandas, etc.)
I estimate 1.5 million to 2 million users
Only incremental improvements possible with
these projects at this point.
Thus, we needed to start new projects...

Related Open Source Projects
Blaze: High-performance Python library for modern
vector computing, distributed and streaming data
Numba:Vectorizing Python compiler for multicore
and GPU, using LLVM
Bokeh: Interactive, grammar-based visualization
system for large datasets
Common Thread: High-level, expressive language for
domain experts; innovative compilers & runtimes
for efﬁcient, powerful data transformation

Conda and Anaconda
• Cross-platform package management
• Multiple environments allows you to have multiple
versions of packages installed in system
• Easy app-deployment
• Taming open-source
Free for all users
Enterprise
support available!

Why Conda?
• Linux users stopped complaining about Python
deployment
• I made major mistakes in management of NumPy/SciPy:
• too much in SciPy (SciPy as distribution) --- scikits model is much
better (tighter libraries developed by smaller teams)
• gave in to community desire for binary ABI compatibility in
NumPy motivated by difﬁculty of reproducing Python install
• Need for a cross-platform way to install major Python
extensions (many with dependencies on large C or C++
libraries
• Python can’t be ubiquitous if people struggle to just get
it and then manage it.

Why conda
Information Technology

What is Conda
• Full package management (like yum or apt-get)
but cross-platform
• Control over environments (using hard-link
farms) --- better than virtual-env. virtualenv today is like
distutils and setuptools of several years ago (great at ﬁrst but will end up
hating it)
• Architected to be able to manage any packages
(R, Scala, Clojure, Haskell, Ruby, JS)
• SAT solver to manage dependencies
• User-deﬁnable repositories

New Features and Binstar
• Build command from recipe --- many recipes here: https://
github.com/ContinuumIO/conda-recipes
• Upload recipes to Binstar (last mile in binary package
hosting and deployment for any language).
• “binstar in beta” is the beta code
• Personal conda repositories --- https://conda.binstar.org/
travis
• Free Continuum Anaconda repo will be on binstar.org
• Private packages and behind-the-ﬁrewall satellites available
• *Free build queue on Linux (Mac and Windows coming
soon) for hosted conda recipes

Demo
create Python 3 environment with IPython and scipy
create new recipe from PyPI (yunomi)

Packaging and Distribution Solved
• conda and binstar solve most of the problems that
we have seen people encounter in managing
Python installations (especially in large-scale
institutions).
• they are supported solutions that can remove the
technology pain of managing Python
• some problems, though, are people

Anaconda (open)
Free enterprise-ready Python distribution of open-
source tools for large-scale data processing,
predictive analytics, and scientiﬁc computing

Anaconda Add-Ons (paid-for)
•Revolutionary Python to GPU compiler
•Extends Numba to take a subset of Python
to the GPU (program CUDA in Python)
•CUDA FFT / BLAS interfaces
Fast, memory-efﬁcient Python interface for
SQL databases, NoSQL stores,Amazon S3,
and large data ﬁles.
NumPy, SciPy, scikit-learn, NumExpr compiled
against Intel’s Math Kernel Library (MKL)

Why Numba?
•Python is too slow for loops
•Most people are not learning C/C++/Fortran today
•Cython is an improvment (but still verbose and
needs C-compiler)
•NVIDIA using LLVM for the GPU
•Many people working with large typed-containers
(NumPy arrays)
•We want to take high-level, tarray-oriented
expressions and compile it to fast code

NumPy + Mamba = Numba
LLVM Library
Intel Nvidia AppleAMD
OpenCLISPC CUDA CLANGOpenMP
LLVMPY
Python Function Machine Code
ARM

Numba
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up

Numba changes the game!
LLVM IR
x86
C++
ARM
PTX
C
Fortran
Python
Numba turns (a subset of) Python into a
“compiled language” as fast as C (but much more
ﬂexible). You don’t have to reach for C/C++

Laplace Example
@jit('void(double[:,:], double, double)')
def numba_update(u, dx2, dy2):
nx, ny = u.shape
for i in xrange(1,nx-1):
for j in xrange(1, ny-1):
u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 +
(u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2))
Adapted from http://www.scipy.org/PerformancePython
originally by Prabhu Ramachandran
@jit('void(double[:,:], double, double)')
def numbavec_update(u, dx2, dy2):
u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 +
(u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))

Results of Laplace example
Version Time Speed Up
NumPy 3.19 1.0
Numba 2.32 1.38
Vect. Numba 2.33 1.37
Cython 2.38 1.34
Weave 2.47 1.29
Numexpr 2.62 1.22
Fortran Loops 2.30 1.39
Vect. Fortran 1.50 2.13
https://github.com/teoliphant/speed.git

LLVMPy worth looking at
LLVM (via
LLVMPy)
has done
much heavy
lifting
LLVMPy =
Compilers for
everybody

What is wrong with NumPy?
• Dtype system is difﬁcult to extend
• Many Dtypes needed (missing data, enums, variable length
strings)
• Immediate mode creates huge temporaries (spawning
Numexpr)
• “Almost” an in-memory data-base comparable to SQL-
lite (missing indexes)
• Integration with sparse arrays
• Standard structure of arrays representation...
• Missing Multi-methods
• Optimization / Minimal support for multi-core / GPU

Now What?
After watching NumPy and SciPy get used all over
Wall Street and by many scientists / engineers in
industry --- what would we do differently?

New Project
Blaze
NumPy
Out of Core,
Distributed and Optimized
NumPy

Blaze Array or Table
Data Descriptor
Data Buffer
Index
Operation
NumPy BLZ
Persistent
Format
RDBMS
CSVData Stream

Blaze Deferred Arrays
+"
A" *"
B" C"
A + B*C
• Symbolic objects which build a graph
• Represents deferred computation
Usually what you have when
you have a Blaze Array

DataShape Type System



• A data description language
• A super-set of NumPy’s dtype
• Provides more ﬂexibility
Shape DType
DataShape

Blaze
Database
GPU Node
Array
Server
NFS
Array
Server
Array
Server
Blaze Client
Synthesized
Array/Table view
array+sql://
array://
ﬁle:// array://
Python REPL,
Scripts
Viz Data
Server
C, C++,
FORTRAN
JVM
languages

Progress
• Basic calculations work out-of-core (via Numba and
LLVM)
• Hard dependency on dynd and dynd-python (a dynamic
C++-only multi-dimensional library like NumPy but with
many improvements)
• Persistent arrays from BLZ
• Basic array-server functionality for layering over CSV
ﬁles
• 0.2 release in 1-2 weeks. 0.3 within a month after that
(ﬁrst usable release)

DARPA providing help
DARPA-BAA-12-38: XDATA
TA-1: Scalable analytics and data processing technology

TA-2: Visual user interface technology

Bokeh Plotting Library
• Interactive graphics for the web
• Designed for large datasets
• Designed for streaming data
• Native interface in Python
• Fast JavaScript component
• DARPA funded
• v0.1 release imminent

Reasons for Bokeh
1. Plotting must happen near the data too
2. Quick iteration is essential => interactive visualization
3. Interactive visualization on remote-data => use the browser
4. Almost all web plotting libraries are either:
1. Designed for javascript programmers
2. Designed to output static graphs
5. We designed Bokeh to be dynamic graphing in the web for
Python programmers
6. Will include “Abstract” or “synthetic” rendering (working on
Hadoop and Spark compatibility)

Wakari
• Browser-based data analysis and
visualization platform
• Wordpress /YouTube / Github
for data analysis
• Full Linux environment with
Anaconda Python
• Can be installed on internal
clusters & servers

Why Wakari?
• Data is too big to fit on your desktop
• You need compute power but don’t have easy access to a
large cluster (cloud is sitting there with lots of power)
• Configuration of software on a new system stinks
(especially a cluster).
• Collaborative Data Analytics --- you want to build a
complex technical workflow and then share it with others
easily (without requiring they do painful configuration to
see your results)
• IPython Notebook is awesome --- let’s share it (but we
also need the dependencies and data).

Wakari
• Free account has 512 MB RAM / 10 GB disk and shared
multi-core CPU
• Easily spin-up map-reduce (Disco and Hadoop clusters)
• Use IPython Parallel on many-nodes in the cloud
• Develop GUI apps (possibly in Anaconda) and publish
them easily to Wakari (based on full power of scientiﬁc
python --- complex technical workﬂows (IPython
notebook for now)

Continuum Data Explorer (CDX)
• Open Source
• Goal is interactivity
• Combination of IPython REPL, Bokeh, and tables
• Tight integration between GUI elements and REPL
• Current features
- Namespace viewer (mapped to IPython namespace)
- DataTable widget with group-by, computed columns, advanced-
ﬁlters
- Interactive Plots connected to tables

Conclusion
Projects circle around giving tools to experts
(occasional programmers or domain
experts) to enable them to move their
expertise to the data to get insights --- keep
data where it is and move high-level but
performant code)
Join us or ask how we can help you!

PyData Boston 2013

More Related Content

What's hot

Similar to PyData Boston 2013

Recently uploaded

PyData Boston 2013