New Capabilities in
the PyData Ecosystem
Peter Wang
Continuum Analytics
@pwang
Data Science @ NYT
Python & SciPy
• High performance linear algebra, image processing,
optimization via NumPy, optimized C++, FORTRAN
• Large structured data via HDF5, memmap
• Out of core processing, streaming & realtime
• Distributed computing via MPI, IPython Parallel, etc.
• GPU & heterogenous via OpenCL, PyCUDA, others
• Massive adoption in research, national labs, industry
(engineering, finance, etc.)
• IPython Notebook: 2005-2011
• pandas: 2008-2009
• scikit-learn: 2007
• NumPy: 2006
• matplotlib: 2002
• IPython: 2001
• Numarray: 2001
• SciPy: 1999
• Numeric: 1995
Python has >15 year history in scientific computing
"Python's Scientific Ecosystem"
@jakevdp
"Many More Tools"
@jakevdp
Focus On
• Bokeh
• Dask
Focus On
• Bokeh
• Dask
• Blaze, odo
• dynd
• xray
• NumPy
• Pandas
• PyTables & h5py
• Beaker Notebook
• IPython widgets, JupyterHub
• conda, Anaconda Cluster
• Docker
• Docker
• Docker
Not Gonna Talk About...
Focus On
• Bokeh
• Dask
• Blaze, odo
• dynd
• xray
• NumPy
• Pandas
• PyTables & h5py
• Beaker Notebook
• IPython widgets, JupyterHub
• conda, Anaconda Cluster
• Docker
• Docker
• Docker
Not Gonna Talk About...
Bokeh
• Interactive visualization
• Novel graphics
• Streaming, dynamic, large data
• For the browser, with or without a server
• No need to write Javascript
• Support for R, Scala, Julia, Lua
http://bokeh.pydata.org
Dashboards & Data Apps
Static Notebooks/HTML, Interactive Plots
http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/00%20-%20intro.ipynb#Interaction
Extensible Architecture
server.py BrowserApp Model
BokehJS
object graph
bokeh-server
bokeh.py
object graph
JSON
rBokeh
http://hafen.github.io/rbokeh
Dask
Example: Ocean Temp Data
• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html
• Every 1/4 degree, 720x1440 array each day
Example: Ocean Temp Data
• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html
• Every 1/4 degree, 720x1440 array each day
Bigger Data
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed
If you don't have this much RAM...
Bigger Data
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed
If you don't have this much RAM...
... better start chunking.
DAG of Computation
Dask: Out of Core Scheduler for Python
Dask: Out of Core Scheduler for Python
• A parallel computing framework
Dask: Out of Core Scheduler for Python
• A parallel computing framework
• That leverages the excellent Python ecosystem
Dask: Out of Core Scheduler for Python
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
Dask: Out of Core Scheduler for Python
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Dask: Out of Core Scheduler for Python
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Dask: Out of Core Scheduler for Python
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Core Ideas
Dask: Out of Core Scheduler for Python
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Core Ideas
• Dynamic task scheduling yields sane parallelism
Dask: Out of Core Scheduler for Python
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Core Ideas
• Dynamic task scheduling yields sane parallelism
• Simple library to enable parallelism
Dask: Out of Core Scheduler for Python
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Core Ideas
• Dynamic task scheduling yields sane parallelism
• Simple library to enable parallelism
• Dask.array/dataframe to encapsulate the functionality
Dask: Out of Core Scheduler for Python
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Core Ideas
• Dynamic task scheduling yields sane parallelism
• Simple library to enable parallelism
• Dask.array/dataframe to encapsulate the functionality
• Distributed scheduler coming
Simple Architecture
Core Concepts
dask.array: OOC, parallel, ND array
Arithmetic: +, *, ...
Reductions: mean, max, ...
Slicing: x[10:, 100:50:-2]
Fancy indexing: x[:, [3, 1, 2]]
Some linear algebra: tensordot, qr, svd
Parallel algorithms (approximate quantiles, topk, ...)
Slightly overlapping arrays
Integration with HDF5
dask.dataframe: OOC, parallel, ND array
Elementwise operations: df.x + df.y
Row-wise selections: df[df.x > 0]
Aggregations: df.x.max()
groupby-aggregate: df.groupby(df.x).y.max()
Value counts: df.x.value_counts()
Drop duplicates: df.x.drop_duplicates()
Join on index: dd.merge(df1, df2, left_index=True,
right_index=True)
More Complex Graphs
cross validation
http://continuum.io/blog/xray-dask
PyData's Future
• Dozens of international meetup groups
• Intl conferences each year, including collab
with EuroPython, Strata, and others
• More companies investing in the ecosystem
• Dato - SFrame, SGraph, ...
• Cloudera - Impyla, Ibis, ...
• Microsoft - Python in AzureML
• Databricks - PySpark
• Continuum - *.*

New Capabilities in the PyData Ecosystem

  • 1.
    New Capabilities in thePyData Ecosystem Peter Wang Continuum Analytics @pwang
  • 2.
  • 3.
    Python & SciPy •High performance linear algebra, image processing, optimization via NumPy, optimized C++, FORTRAN • Large structured data via HDF5, memmap • Out of core processing, streaming & realtime • Distributed computing via MPI, IPython Parallel, etc. • GPU & heterogenous via OpenCL, PyCUDA, others • Massive adoption in research, national labs, industry (engineering, finance, etc.) • IPython Notebook: 2005-2011 • pandas: 2008-2009 • scikit-learn: 2007 • NumPy: 2006 • matplotlib: 2002 • IPython: 2001 • Numarray: 2001 • SciPy: 1999 • Numeric: 1995 Python has >15 year history in scientific computing
  • 4.
  • 5.
  • 6.
  • 7.
    Focus On • Bokeh •Dask • Blaze, odo • dynd • xray • NumPy • Pandas • PyTables & h5py • Beaker Notebook • IPython widgets, JupyterHub • conda, Anaconda Cluster • Docker • Docker • Docker Not Gonna Talk About...
  • 8.
    Focus On • Bokeh •Dask • Blaze, odo • dynd • xray • NumPy • Pandas • PyTables & h5py • Beaker Notebook • IPython widgets, JupyterHub • conda, Anaconda Cluster • Docker • Docker • Docker Not Gonna Talk About...
  • 9.
    Bokeh • Interactive visualization •Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • No need to write Javascript • Support for R, Scala, Julia, Lua http://bokeh.pydata.org
  • 10.
  • 11.
    Static Notebooks/HTML, InteractivePlots http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/00%20-%20intro.ipynb#Interaction
  • 12.
    Extensible Architecture server.py BrowserAppModel BokehJS object graph bokeh-server bokeh.py object graph JSON
  • 14.
  • 15.
  • 16.
    Example: Ocean TempData • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day
  • 17.
    Example: Ocean TempData • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day
  • 18.
    Bigger Data 36 years:720 x 1440 x 12341 x 4 = 51 GB uncompressed If you don't have this much RAM...
  • 19.
    Bigger Data 36 years:720 x 1440 x 12341 x 4 = 51 GB uncompressed If you don't have this much RAM... ... better start chunking.
  • 20.
  • 21.
    Dask: Out ofCore Scheduler for Python
  • 22.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework
  • 23.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework • That leverages the excellent Python ecosystem
  • 24.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling
  • 25.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python
  • 26.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python
  • 27.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python Core Ideas
  • 28.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python Core Ideas • Dynamic task scheduling yields sane parallelism
  • 29.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism
  • 30.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism • Dask.array/dataframe to encapsulate the functionality
  • 31.
    Dask: Out ofCore Scheduler for Python • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism • Dask.array/dataframe to encapsulate the functionality • Distributed scheduler coming
  • 32.
  • 33.
  • 34.
    dask.array: OOC, parallel,ND array Arithmetic: +, *, ... Reductions: mean, max, ... Slicing: x[10:, 100:50:-2] Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svd Parallel algorithms (approximate quantiles, topk, ...) Slightly overlapping arrays Integration with HDF5
  • 35.
    dask.dataframe: OOC, parallel,ND array Elementwise operations: df.x + df.y Row-wise selections: df[df.x > 0] Aggregations: df.x.max() groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts() Drop duplicates: df.x.drop_duplicates() Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
  • 36.
  • 37.
  • 38.
    PyData's Future • Dozensof international meetup groups • Intl conferences each year, including collab with EuroPython, Strata, and others • More companies investing in the ecosystem • Dato - SFrame, SGraph, ... • Cloudera - Impyla, Ibis, ... • Microsoft - Python in AzureML • Databricks - PySpark • Continuum - *.*