Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"

NumPy and SciPy: History
and Ideas for the Future
Travis E. Oliphant

SciPy Tokyo. March 18, 2012

Saturday, March 17, 12

My Roots


My Roots
Images from BYU Mers Lab


Science led to Python
2
⇢0 (2⇡f ) Ui (a, f ) = [Cijkl (a, f ) Uk,l (a, f )],j

Raja Muthupillai

Richard Ehman
1997

Armando Manduca


Finding derivatives of 5-d data
⌅=r⇥U


Scientist at heart


Python origins. http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

Version Date
0.9.0 Feb. 1991
0.9.4 Dec. 1991
0.9.6 Apr. 1992
0.9.8 Jan. 1993
1.0.0 Jan. 1994
1.2 Apr. 1995
1.4 Oct. 1996
1.5.2 Apr. 1999


Brief History

Person Package Year
Matrix Object
Jim Fulton 1994
in Python
Jim Hugunin Numeric 1995
Perry Greenﬁeld, Rick
White, Todd Miller Numarray 2001

Travis Oliphant NumPy 2005


Found Python and Numeric in 1997
I was a fairly proficient MATLAB user, but it was not memory efficient enough.

Loved the expressive syntax of Python
Loved the fact that slicing didn’t make copies
Loved the existing multiple data-types
Loved how much more flexible it was to
extend than MATLAB was
Loved that I could read the source code and
extend it


First problem: Efﬁcient Data Input
“It’s All About the Data”

Reference Counting Essay
TableIO http://www.python.org/doc/essays/refcnt/
April 1998 May 1998

Michael A. Miller Guido van Rossum

NumPyIO
June 1998


Early pieces of SciPy
fftw wrappers cephesmodule
June 1998 November 1998

stats.py
December 1998
Gary
Strangman


1999 : Early SciPy emerges
Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
environment: Paul Barrett, Joe Harrington, Perry Greenﬁeld, Paul Dubois, Konrad Hinsen,
and others. Activity in 1998, led to increased interest in 1999.

In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be
present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
be creating this uber-package which eventually became SciPy

Gaussian quadrature 5 Jan 1999
cephes 1.0 30 Jan 1999
sigtools 0.40 23 Feb 1999
Numeric docs March 1999
cephes 1.1 9 Mar 1999 Plotting??
multipack 0.3 13 Apr 1999
Helper routines 14 Apr 1999 Gist
multipack 0.6 (leastsq, ode, fsolve, 29 Apr 1999 XPLOT
quad)
DISLIN
sparse plan described 30 May 1999
Gnuplot
multipack 0.7 14 Jun 1999
SparsePy 0.1
cephes 1.2 (vectorize)
5 Nov 1999
29 Dec 1999
Helping with f2py


Early 2000 : Numeric needs

• Memory mapped arrays
• Rank-0 arrays or scalars
• Handling indirect indexing: a[[10,5,7]]
• Handling masked indexing: a[[True, False, False]
• More attributes to N-d arrays
• “Record arrays”


Early contributors in 1999
Hosting of first Multipack CVS repository (June 1999)
Amazing makefiles
Interface to FITPACK
Wrote f2py as he watched my brute-force approach (July 1999)
Pearu Peterson

(IPP) Early IPython interactive environment (27 Apr 1999)
Matlab file reader (24 Apr 1999)

Janko Hauser

Created windows binaries of multipack, cephesmodule,
fftw, and signaltools (June 1999 while still in high school!)

Robert Kern


2000-2001


SciPy 2001 Travis Oliphant
optimize
sparse
interpolate
integrate
special
signal
stats Founded in 2001 with Travis Vaught
fftpack
misc

Eric Jones
weave
cluster
Pearu Peterson
GA*
linalg
interpolate
f2py


Finally... Plotting

John Hunter
2001


IPython (building on IPP)
Fernando Perez
Dec 10, 2001


STSCI leads out with Numarray

Perry Greenﬁeld
J. Todd Miller
Rick White
Paul Barrett


Numarray released in 2003

• too slow for small arrays
• incomplete implementation for ufuncs
• minimal Numeric code re-use

• lots of very nice things, though (e.g. memory
maps, fast code for large arrays, better
sorting algorithms)


Split in the community

Numeric
SciPy

Numarray
ndimage
others


ndimage


started January 2005

NumPy
Version 1.0 October 2006

Key contributions from:
Numarray
Numeric
Chuck Harris
Robert Kern
David Cooke
Pierre GM


NumPy in Python
Q: When is NumPy going to be part of the
core Python?

A: Data structure is in Python as PEP 3118


Left Academica For Industry


Community effort many, many others --- forgive me!
• Chuck Harris
• Pauli Virtanen
• David Cournapeau
• Stefan van der Walt
• Jarrod Millman
• Josef Perktold
• Anne Archibald
• Dag Sverre Seljebotn
• Robert Kern
• Matthew Brett
• Warren Weckesser
• Ralf Gommers
• Joe Harrington --- Documentation effort
• Andrew Straw --- www.scipy.org


NumPy Array

shape


NumPy Essentials
• Data: the array object
– slicing
– shapes and strides
– data-type generality

• Fast Math:
– vectorization
– broadcasting
– aggregations


Zen of NumPy
• strided is better than scattered
• contiguous is better than strided
• descriptive is better than imperative
• array-oriented is better than object-oriented
• broadcasting is a great idea
• vectorized is better than an explicit loop
• unless it’s too complicated --- then use Cython / weave
• think in higher dimensions


Now What?


Experiences Conulsting
• Basically, I was a consultant and manager of
consultants for 4 years.
• It kept the lights on at home, but gave me minimal
time for NumPy.
• Silver lining was that I learned exactly what “big-data”
problems big companies have and saw first-hand that
numerous “little” improvements need to be made to
NumPy which will allow it to have a larger impact on
helping to manage the world’s data.


Putting Science back in CS
• Much of the software stack is for systems
programming --- C++, Java, .NET, ObjC, web
- Complex numbers?
- Vectorization primitives?
• Array-based programming has been
supplanted by Object-oriented programming
• Software stack for scientists is not as helpful
as it should be
• Fortran is still where many scientists end up


APL : the ﬁrst array-oriented language
• Appeared in 1964
• Originated by Ken Iverson
• Direct descendants (J, K, Matlab) are still
used heavily and people pay a lot of money
for them APL
• NumPy is a descendent J

K Matlab
Numeric
NumPy


Conway’s game of Life
• Dead cell with exactly 3 live neighbors
will come to life
• A live cell with 2 or 3 neighbors will
survive
• With too few or too many neighbors, the
cell dies


Interesting Patterns emerge


Conway’s Game of Life
APL

NumPy
Initialization

Update Step


Improvements needed
• NDArray improvements
• Indexes (esp. for Structured arrays)
• SQL front-end
• Multi-level, hierarchical labels
• selection via mappings (labeled arrays)
• Memory spaces (array made up of regions)
• Distributed arrays (global array)
• Compressed arrays
• Standard distributed persistance
• fancy indexing as view and optimizations
• streaming arrays


Improvements needed
• Dtype improvements
• Enumerated types (including dynamic enumeration)
• Derived fields
• Specification as a class (or JSON)
• Pointer dtype (i.e. C++ object, or varchar)
• Finishing datetime
• Missing data with both bit-patterns and mask
• Parameterized field names


Example of Object Dtype

@np.dtype
class Stock(np.DType):
symbol = np.Str(4)
open = np.Int(2)
close = np.Int(2)
high = np.Int(2)
low = np.Int(2)
@np.Int(2)
def mid(self):
return (self.high + self.low) / 2.0


Improvements needed
• Ufunc improvements
• Generalized ufuncs support more than just
contiguous arrays
• Specification of ufuncs in Python
• Move most dtype “array functions” to ufuncs
• Unify error-handling for all computations
• Allow lazy-evaluation and remote computation ---
streaming and generator data
• Structured and string dtype ufuncs
• Multi-core and GPU optimized ufuncs
• Group-by reduction


Improvements needed
• Miscellaneous improvements
• ABI-management
• Eventual Move to C++ library (NDLib)
• NDLib could serve as base for Javascript and other
high-level languages
• Integration with LLVM
• Possible dtype / shape / stride unification into a
“dimension protocol”
• Remote computation
• Fast I/O for CSV and Excel


NumPy Users

• Want to be able to write Python to get fast
code that works on arrays and scalars
• Need access to a boat-load of C-extensions
(NumPy is just the beginning)

PyPy doesn’t cut it for us!


Ufuncs

Generalized
UFuncs
Python
Function
Window
Kernel
Funcs

Function-
based
Indexing

Memory
Dynamic compilation

Filters
Dynamic
Compilation

NumPy Runtime
I/O Filters

Reduction
Filters

Computed
Columns
function pointer

SciPy needs a Python compiler

optimize integrate

special ode

writing more of SciPy at high-level


Numba -- a Python compiler

• Replays byte-code on a stack with simple type-
inference
• Translates to LLVM (using LLVM-py)
• Uses LLVM for code-gen
• Resulting C-level function-pointer can be
inserted into NumPy run-time
• Understands NumPy arrays
• Is NumPy / SciPy aware


NumPy + Mamba = Numba
Python Function Machine Code

LLVM-PY

LLVM 3.1
ISPC OpenCL OpenMP CUDA CLANG

Intel AMD Nvidia Apple


Examples


Software Stack Future?
Plateaus of Code re-use + DSLs
SQL R
TDPL Matlab

Python

OBJC C
FORTRAN C++

LLVM


Seeking Developers!

https://github.com/ContinuumIO/numba


How to pay for all this?


Dual strategy

NumPy 2.0


NumFOCUS
Num(Py) Foundation for Open Code for Usable Science


NumFOCUS

• Mission
• To initiate and support educational programs
furthering the use of open source software in
science.
• To promote the use of high-level languages and
open source in science, engineering, and math
research
• To encourage reproducible scientific research


NumFOCUS

• Activites
• Sponsor sprints and conferences
• Provide scholarships and grants
• Pay for documentation development and basic
course development
• Work with domain-specific organizations
• Raise funds from industries using Python and
NumPy


NumFOCUS

• Directors
• Jarrod Millman
• Fernando Perez
• Travis Oliphant
• Perry Greenfield
• John Hunter
• Members
• Basically people who donate for now. In time, a
body that elects directors.


Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"

More Related Content

What's hot

Similar to Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"

More from Shohei Hido

Recently uploaded

Travis E. Oliphant, "NumPy and SciPy: History and Ideas for the Future"