New Capabilities in the PyData Ecosystem

New Capabilities in
the PyData Ecosystem
Peter Wang
Continuum Analytics
@pwang

Python & SciPy
• High performance linear algebra, image processing,
optimization via NumPy, optimized C++, FORTRAN
• Large structured data via HDF5, memmap
• Out of core processing, streaming & realtime
• Distributed computing via MPI, IPython Parallel, etc.
• GPU & heterogenous via OpenCL, PyCUDA, others
• Massive adoption in research, national labs, industry
(engineering, ﬁnance, etc.)
• IPython Notebook: 2005-2011
• pandas: 2008-2009
• scikit-learn: 2007
• NumPy: 2006
• matplotlib: 2002
• IPython: 2001
• Numarray: 2001
• SciPy: 1999
• Numeric: 1995
Python has >15 year history in scientiﬁc computing

"Python's Scientiﬁc Ecosystem"
@jakevdp

Focus On
• Bokeh
• Dask
• Blaze, odo
• dynd
• xray
• NumPy
• Pandas
• PyTables & h5py
• Beaker Notebook
• IPython widgets, JupyterHub
• conda, Anaconda Cluster
• Docker
• Docker
• Docker
Not Gonna Talk About...

Bokeh
• Interactive visualization
• Novel graphics
• Streaming, dynamic, large data
• For the browser, with or without a server
• No need to write Javascript
• Support for R, Scala, Julia, Lua
http://bokeh.pydata.org

Static Notebooks/HTML, Interactive Plots
http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/00%20-%20intro.ipynb#Interaction

Extensible Architecture
server.py BrowserApp Model
BokehJS
object graph
bokeh-server
bokeh.py
object graph
JSON

rBokeh
http://hafen.github.io/rbokeh

Example: Ocean Temp Data
• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html
• Every 1/4 degree, 720x1440 array each day

Bigger Data
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed
If you don't have this much RAM...

Bigger Data
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed
If you don't have this much RAM...
... better start chunking.

Dask: Out of Core Scheduler for Python

• A parallel computing framework

• That leverages the excellent Python ecosystem

• Using blocked algorithms and task scheduling

• Written in pure Python

Core Ideas

Core Ideas
• Dynamic task scheduling yields sane parallelism

Core Ideas
• Simple library to enable parallelism

Core Ideas
• Dask.array/dataframe to encapsulate the functionality

Core Ideas
• Dask.array/dataframe to encapsulate the functionality
• Distributed scheduler coming

dask.array: OOC, parallel, ND array
Arithmetic: +, *, ...
Reductions: mean, max, ...
Slicing: x[10:, 100:50:-2]
Fancy indexing: x[:, [3, 1, 2]]
Some linear algebra: tensordot, qr, svd
Parallel algorithms (approximate quantiles, topk, ...)
Slightly overlapping arrays
Integration with HDF5

dask.dataframe: OOC, parallel, ND array
Elementwise operations: df.x + df.y
Row-wise selections: df[df.x > 0]
Aggregations: df.x.max()
groupby-aggregate: df.groupby(df.x).y.max()
Value counts: df.x.value_counts()
Drop duplicates: df.x.drop_duplicates()
Join on index: dd.merge(df1, df2, left_index=True,
right_index=True)

More Complex Graphs
cross validation

http://continuum.io/blog/xray-dask

PyData's Future
• Dozens of international meetup groups
• Intl conferences each year, including collab
with EuroPython, Strata, and others
• More companies investing in the ecosystem
• Dato - SFrame, SGraph, ...
• Cloudera - Impyla, Ibis, ...
• Microsoft - Python in AzureML
• Databricks - PySpark
• Continuum - *.*

New Capabilities in the PyData Ecosystem

More Related Content

What's hot

Viewers also liked

Similar to New Capabilities in the PyData Ecosystem

More from Turi, Inc.

Recently uploaded

New Capabilities in the PyData Ecosystem