Standardizing on a single N-dimensional array API for Python

Standardizing on a single
N-dimensional array API for Python
Ralf Gommers
MXNet workshop, 14 Dec 2020

Array-based computing in Python

Today’s Python data ecosystem
Can we make it easy to build on top of multiple array data structures?

State of compatibility today
All libraries have common concepts and functionality.
But, there are many small (and some large) incompatibilities. It’s very painful to
translate code from one array library to another.
Let’s look at some examples!

Consortium for Python Data API Standards
A new organization, with participation from maintainers of many array (or
tensor) and dataframe libraries.
Concrete goals for ﬁrst year:
1. Deﬁne a standardization methodology and necessary tooling for it
2. Publish an RFC for an array API standard
3. Publish an RFC for a dataframe API standard
4. Finalize 2021.0x API standards after community review
See data-apis.org and github.com/data-apis for more on the Consortium

Goals for and scope of the array API
Syntax and semantics of functions
and objects in the API
Casting rules, broadcasting, indexing,
Python operator support
Data interchange & device support
Execution semantics (e.g. task
scheduling, parallelism, lazy eval)
Non-standard dtypes, masked arrays,
I/O, subclassing array object, C API
Error handling & behaviour for invalid
inputs to functions and methods
Goal 1: enable writing code & packages that support multiple array libraries
Goal 2: make it easy for end users to switch between array libraries
In Scope Out of Scope

Array- and array-consuming libraries
Using DLPack, will work for any two
libraries if they support device the
data resides on
x = xp.from_dlpack(x_other)
Data interchange between array libs
Portable code in array-consuming libs
def softmax(x):
# grab standard namespace from
# the passed-in array
xp = get_array_api(x)
x_exp = xp.exp(x)
partition = xp.sum(x_exp, axis=1,
keepdims=True)
return x_exp / partition

What does the full API surface look like?
● 1 array object with
○ 6 attributes: ndim, shape, size, dtype, device, T
○ dunder methods to support all Python operators
○ __array_api_version__, __array_namespace__, __dlpack__
● 11 dtype literals: bool, (u)int8/16/32/64, ﬂoat32/64
● 1 device object
● 4 constants: inf, nan, pi, e
● ~115 functions:
○ Array creation & manipulation (18)
○ Element-wise math & logic (55)
○ Statistics (7)
○ Linear algebra (22)
○ Search, sort & set (7)
○ Utilities (4)

Mutability & copies/views
x = ones(4)
# y may be a view on data of x
y = x[:2]
# modifies x if y is a view
y += 1
Mutable operations and the concept of views are
important for strided in-memory array implementations
(NumPy, CuPy, PyTorch, MXNet)
They are problematic for libraries based on immutable data
structures or delayed evaluation (TensorFlow, JAX, Dask)
Decisions in API standard:
1. Support inplace operators
2. Support item and slice assignment
3. Do not support out= keyword
4. Warn users that mixing mutating operations and views
may result in implementation-speciﬁc behavior

Dtype casting rules
x = xp.arange(5) # will be integer
y = xp.ones(5, dtype=xp.float32)
# This may give float32, float64, or raise
dtype = (x * y).dtype
Casting rules are straightforward to align between
libraries when the dtypes are of the same kind
Mixed integer and ﬂoating-point casting is very
inconsistent between libraries, and hard to change:
Hence this will remain unspeciﬁed.

Data-dependent output shape/dtype
# Boolean indexing, and even slicing
# in some cases, results in shapes
# that depend on values in `x`
x2 = x[:, x > 3]
val = somefunc(x)
x3 = x[:val]
# Functions for which output shape
# depends on value
unique(x)
nonzero(x)
# NumPy does value-based casting
x = np.ones(3, dtype=np.float32)
x + 1 # float32 output
x + 100000 # float64 output
Data-dependent output shapes or dtypes are
problematic, because of:
● static memory allocation (TensorFlow, JAX)
● graph-based scheduling (Dask)
● JIT compilation (Numba, PyTorch, JAX,
Gluon)
Value-based dtype results can be avoided.
Value-based shapes can be important - the API
standard will include but clearly mark such
functionality.

Where are we now, and what’s next?
The array API standard is >90% complete and published for community review.
Still work-in-progress are:
● Data interchange with DLPack
● Device support
● Data-dependent shape handling
● A handful of regular functions (linalg, result_type, meshgrid)
Important next steps will be:
1. Complete the library-independent test suite
2. First (prototype) implementations in libraries
3. Get sign-oﬀ from maintainers of each array library
4. Deﬁne process to handle future & optional extensions

Thank you
Consortium:
● Website & introductory blog posts: data-apis.org
● Array API main repo: github.com/data-apis/array-api
● Latest version of the standard: data-apis.github.io/array-api/latest
● Members: github.com/data-apis/governance
Find me at: ralf.gommers@gmail.com, rgommers, ralfgommers
Try this at home - installing the latest version of all seven array libraries in one
env to experiment:
conda create -n many-libs python=3.7
conda activate many-libs
conda install cudatoolkit=10.2
pip install numpy torch jax jaxlib tensorflow mxnet cupy-cuda102 dask toolz sparse

Standardizing on a single N-dimensional array API for Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Standardizing on a single N-dimensional array API for Python

Similar to Standardizing on a single N-dimensional array API for Python (20)

More from Ralf Gommers

More from Ralf Gommers (9)

Recently uploaded

Recently uploaded (20)

Standardizing on a single N-dimensional array API for Python