SlideShare a Scribd company logo
Scaling Python to GPUs and CPUs
Stanford Stats 285
October 30, 2017
Travis E. Oliphant
President, Chief Data Scientist, Co-founder Anaconda, Inc.
1
My Background
2
• MS/BS degrees in Elec. Comp. Engineering
• PhD from Mayo Clinic in Biomedical Engineering
(Ultrasound and MRI)
• Creator and Developer of SciPy (1998-2009)
• Professor at BYU (2001-2007) Inverse Problems
• Creator and Developer of NumPy (2005-2012)
• Started Numba (2012)
• Founder of NumFOCUS / PyData
• Python Software Foundation Director (2012)
• Co-founder of Continuum Analytics => Anaconda, Inc.
• CEO (2012) => Chief Data Scientist (2017)
SciPy
3
Empower domain experts with high-level tools that
exploit modern hard-ware
Array Oriented Computing
expertise
• Express domain knowledge
directly in arrays (tensors,
matrices, vectors) --- easier to
teach programming in domain
• Can take advantage of
parallelism and accelerators
• Array expressions
Why Array-oriented computing
4
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Attr1 Attr2 Attr3
Object1
Object2
Object3
Object4
Object5
Object6
• Today’s vector machines (and vector co-processors, or GPUS) were
made for array-oriented computing.
• The software stack has just not caught up --- unfortunate because
APL came out in 1963.
• There is a reason Fortran remains popular.
More reasons for array-oriented
5
6
Python and in particular PyData is Growing
7
Python’s Scientific Stack
8
Python’s Scientific Stack
9
Bokeh
Python’s Scientific Stack
10
Bokeh
Python’s Scientific Stack
Python’s Scientific Ecosystem
11
(and
many,
many
more)
Bokeh
12
13
Bringing
Technology
Together
Data Science Workflow
14
New Data
NotebooksUnderstand Data
Getting Data
Understand World
Reports
Microservices
Dashboards
Applications
Decisions
and
Actions
Models
Exploratory Data Analysis and Viz
Data Products
15
New Data
NotebooksUnderstand Data
Getting Data
Understand World
Reports
Microservices
Dashboards
Applications
Decisions
and
Actions
Models
Exploratory Data Analysis and Viz
Data Products
Data Science Workflow
Machine Learning Explosion
16
Scikit-Learn
Tensorflow
Keras
XGBoost
theano
lasagne
caffe/caffe2
torch
mxnet / minpy
neon
CNTK
DAAL
Chainer
Dynet
Apache Singa
Shogun
https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose
http://deeplearning.net/software_links/
http://scikit-learn.org/stable/related_projects.html
17
A Representation of Packages for ML
© 2017 Anaconda, Inc. - Confidential & Proprietary
18
NumPy and Packages that Depend on It
© 2017 Anaconda, Inc. - Confidential & Proprietary
19
pandas Depends on NumPy
(and other packages depend on pandas)
© 2017 Anaconda, Inc. - Confidential & Proprietary
20
Caffe Depends on pandas and NumPy
© 2017 Anaconda, Inc. - Confidential & Proprietary
Embrace Innovation Without Anarchy
21
From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft
Reproducibility
22
Conda
Conda Forge
Conda Environments
Anaconda Project
A cross-platform and language agnostic package and
environment manager
A community-led collection of recipes, build
infrastructure, and packages for conda.
Custom isolated software sandboxes to allow easy
reproducibility and sharing of data-science work.
Reproducible, executable project directories
• Language independent
• Platform independent
• No special privileges required
• No VMs or containers
• Enables:
- Reproducibility
- Collaboration
- Scaling
“conda – package everything”
23
A
Python v2.7
Conda Sandboxing Technology
B
Python
v3.4
Pandas
v0.18
Jupyter
C
R
R
Essentials
conda
NumPy
v1.11
NumPy
v1.10
Pandas
v0.16
24
$ anaconda-project run plot —show
conda install tensorflow
Basic Conda Usage
25
Install a package conda install sympy
List all installed packages conda list
Search for packages
conda search llvm
Create a new environment
conda create -n py3k python=3
Remove a package
conda remove nose
Get help
conda install --help
Advanced Conda Usage
26
Install a package in an environment conda install -n py3k sympy
Update all packages conda update --all
Export list of packages conda list --export packages.txt
Install packages from an export conda install --file packages.txt
See package history conda list --revisions
Revert to a revision conda install --revision 23
Remove unused packages and cached
tarballs
conda clean -pt
27
Development Deployment
Conda eases rapid deployment
NumPy
28
Without NumPy
29
from math import sin, pi
def sinc(x):
if x == 0:
return 1.0
else:
pix = pi*x
return sin(pix)/pix
def step(x):
if x > 0:
return 1.0
elif x < 0:
return 0.0
else:
return 0.5
functions.py
>>> import functions as f
>>> xval = [x/3.0 for x in
range(-10,10)]
>>> yval1 = [f.sinc(x) for x
in xval]
>>> yval2 = [f.step(x) for x
in xval]
Python is a great language but
needed a way to operate quickly
and cleanly over multi-
dimensional arrays.
With NumPy
30
from numpy import sin, pi
from numpy import vectorize
import functions as f
vsinc = vectorize(f.sinc)
def sinc(x):
pix = pi*x
val = sin(pix)/pix
val[x==0] = 1.0
return val
vstep = vectorize(f.step)
def step(x):
y = x*0.0
y[x>0] = 1
y[x==0] = 0.5
return y
>>> import functions2 as f
>>> from numpy import *
>>> x = r_[-10:10]/3.0
>>> y1 = f.sinc(x)
>>> y2 = f.step(x)
functions2.py
Offers N-D array, element-by-element
functions, and basic random numbers,
linear algebra, and FFT capability for
Python
http://numpy.org
Fiscally sponsored by NumFOCUS
NumPy: an Array Extension of Python
31
• Data: the array object
– slicing and shaping
– data-type map to Bytes
• Fast Math (ufuncs):
– vectorization
– broadcasting
– aggregations
shape
NumPy Array
32
Key Attributes
• dtype
• shape
• ndim
• strides
• data
NumPy Examples
33
2d array
3d array
[439 472 477]
[217 205 261 222 245 238]
9.98330639789 2.96677717122
NumPy Slicing (Selection)
34
>>> a[0,3:5]
array([3, 4])
>>> a[4:,4:]
array([[44, 45],
[54, 55]])
>>> a[:,2]
array([2,12,22,32,42,52])
>>> a[2::2,::2]
array([[20, 22, 24],
[40, 42, 44]])
Summary
35
• Provides foundational N-dimensional array composed
of homogeneous elements of a particular “dtype”
• The dtype of the elements is extensive (but difficult to
extend)
• Arrays can be sliced and diced with simple syntax to
provide easy manipulation and selection.
• Provides fast and powerful math, statistics, and linear
algebra functions that operate over arrays.
• Utilities for sorting, reading and writing data also
provided.
Scaling Up and Out with
Numba and Dask
36
Scale Up vs Scale Out
37
Big Memory &
Many Cores
/ GPU Box
Best of Both
(e.g. GPU Cluster)
Many commodity
nodes in a cluster
ScaleUp
(BiggerNodes)
Scale Out
(More Nodes)
Numba
Dask
Dask with Numba
© 2017 Anaconda, Inc. - Confidential & Proprietary
Development
Name Latest
Release
Number of
Releases
GitHub Stars Contributors Downloads in
T12m
numba 0.35 92 2647 74 1.8m
dask 0.15.4 38 2001 112 1.2m
dask-ml 0.3.1 6 104 4
Numba
Dask
Dask-ml
http://numba.pydata.org http://github.com/numba
http://dask.pydata.org http://github.com/dask
http://dask-ml.readthedocs.io/en/latest/index.html
http://github.com/dask/dask-ml
Numba
39
Numba (compile Python to CPUs and GPUs)
40
conda install numba
Intermediate
Representatio
n (IR)
x86
ARM
PTX
Python
LLVMNumba
Code Generation
Backend
Parsing
Frontend
41
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up
Image Processing
Works with and does not replace the standard Python interpreter
(all of your existing Python libraries are still available)
Numba Features
42
Example: Filter an array
43
Example: Filter an array
44
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
Numba decorator
(nopython=True not required)
2.7x Speedup
over NumPy!
NumPy UFuncs and GUFuncs
45
NumPy ufuncs (and gufuncs) are functions that operate “element-wise” (or “sub-
dimension-wise”) across an array without an explicit loop.
This implicit loop (which is in machine code) is at the core of why NumPy is fast.
Dispatch is done internally to a particular code-segment based on the type of the
array. It is a very powerful abstraction in the PyData stack.
Making new fast ufuncs used to be only possible in C — painful!
With numba.vectorize and numba.guvectorize it is now easy!
The inner secrets of NumPy are now at your finger-tips for you to make your own
magic!
Simple Ufunc
46
@vectorize
def dot2(a,b,x,y):
return a*x + b*y
>>> a, b, x, y = np.random.randn(4,1000)
>>> z = a * x + b * y
>>> z2 = dot2(a, b, x, y) # faster
Faster (especially) as N grows because it does not create temporaries.
NumPy creates temporary arrays for intermediate results.
Numba creates a fast machine-code kernel from the Python template
and calls it for every element in the arrays.
Generalized Ufunc
47
@guvectorize(‘f8[:], f8[:], f8[:]’, ‘(n),(n)->()’)
def dot2(a,b,c):
c[0]=a[0]*b[0] + a[1]*b[1]
>>> a, b = np.random.randn(10000,2), np.random.randn(10000,2)
>>> z1 = np.einsum(‘ij,ij->i’, a, b)
>>> z2 = dot2(a, b) # uses last dimension as in each kernel
This can create quite a bit of computation with very little code.
Numba creates a fast machine-code kernel from the Python template
and calls it for every element in the arrays.
3.8x faster
48
Perform a computation
on a finite window of the input.
For a linear system, this is a FIR
filter and what np.convolve or
sp.signal.lfilter can do.
But what if you want some
arbitrary computation like a
windowed median filter.
Example: Making a windowed compute filter
49
Hand-coded implementation
Build a ufunc for the kernel which is faster for
large arrays!
This can now run easily on GPU
with ‘target=cuda’ and many-cores
‘target=parallel’
Array-oriented!
Example: Making a windowed compute filter
50
Example: Making a windowed compute filter
1. Create a realistic benchmark test case.
(Do not use your unit tests as a benchmark!)
2. Run a profiler on your benchmark.
(cProfile is a good choice)
3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.
(see online documentation)
4. Apply @numba.jit, @numba.vectorize, and @numba.guvectorize as needed to critical
functions. (Small rewrites may be needed to work around Numba limitations.)
5. Re-run benchmark to check if there was a performance improvement.
6. Use target=parallel to get access to multiple cores (or target=cuda if you have a
GPU)
How to Use Numba
51
How Numba works
52
Bytecode
Analysis
Python Function
(bytecode)
Function
Arguments
Type
Inference
Numba IR
LLVM IR
Machine
Code
@jit
def do_math(a,b):
…
>>> do_math(x, y)
Cache
Execute!
Rewrite IR
Lowering
LLVM JIT
7 things about Numba you may not know
53
1
2
3
4
5
6
7
Numba is 100% Open Source
Numba + Jupyter = Rapid
CUDA Prototyping
Numba can compile for the
CPU and the GPU at the same time
Numba makes array processing
easy with @(gu)vectorize
Numba comes with a
CUDA Simulator
You can send Numba
functions over the network
Numba developers are working
On a GPU DataFrame (pygdf)
Numba is quite popular!
54
A numba mailing list reports experiments of a SciPy author who got 2x speed-
up by removing their Cython type annotations and surrounding function with
numba.jit (with a few minor changes needed to the code).
With Numba’s ahead-of-time compilation one can use Numba to create a
library that you ship to others (who then don’t need to have Numba installed).
This is not as clean as it could be.
SciPy (and NumPy) would look very different in Numba had existed 16 years
ago when SciPy was getting started — and the PyPy crowd would be happier.
Releasing the GIL
55
Only nopython
mode functions
can release the
GIL
Releasing the GIL
56
2.8x speedup with 4
cores
CUDA Python (in open-source Numba!)
57
CUDA Development
using Python syntax for
optimal performance!
10-20x faster than CPU
You have to understand
CUDA at least a little —
writing kernels that launch in
parallel on the GPU
Classic Example
58
from numba import jit
@jit
def mandel(x, y, max_iters):
c = complex(x,y)
z = 0j
for i in range(max_iters):
z = z*z + c
if z.real * z.real + z.imag * z.imag >= 4:
return 255 * i // max_iters
return 255
Mandelbrot
The Basics
59
CPython 1x
Numpy array-wide operations 13x
Numba (CPU) 120x
Numba (NVidia Tesla K20c) 2100x
Mandelbrot
Other topics
60
CUDA Python — write general GPU kernels with Python
Device Arrays — manage memory transfer from host to GPU
Streaming — manage asynchronous and parallel GPU compute streams
CUDA Simulator in Python — to help debug your kernels
HSA Support — early support for HSA-based GPUs and APUs
Pyculib — access to cuFFT, cuBLAS, cuSPARSE, cuRAND, CUDA Sorting
https://github.com/ContinuumIO/gtc2017-numba
Dask
61
• Designed to parallelize the Python ecosystem
• Handles complex algorithms
• Co-developed with Pandas/SKLearn/Jupyter teams
• Familiar APIs for Python users
• Scales
• Scales from multicore to 1000-node clusters
• Resilience, responsive, and real-time
• Parallelizes NumPy, Pandas, SKLearn
• Satisfies subset of these APIs
• Uses these libraries internally
• Co-developed with these teams
• Task scheduler supports custom algorithms
• Parallelize existing code
• Build novel real-time systems
• Arbitrary task graphs
with data dependencies
• Same scalability
demo video
• High level: Scaling Pandas
• Same Pandas look and feel
• Uses Pandas under the hood
• Scales nicely onto many machines
• Low level: Arbitrary task scheduling
• Parallelize normal Python code
• Build custom algorithms
• React real-time
• Demo deployed with
• dask-kubernetes
Google Compute Engine
• github.com/dask/dask-kubernetes
• Youtube link
• https://www.youtube.com/watch?v=o
ds97a5Pzw0&
Why do people choose Dask?
• Familiar with Python:
• Drop-in NumPy/Pandas/SKLearn APIs
• Native memory environment
• Easy debugging and diagnostics
• Have complex problems:
• Parallelize existing code without expensive rewrites
• Sophisticated algorithms and systems
• Real-time response to small-data
• Scales up and down:
• Scales to 1000-node clusters
• Also runs cheaply on a laptop
#import pandas as pd
import dask.dataframe as dd
Dask
66
• Started as part of Blaze in early 2014.
• General parallel programming engine
• Flexible and therefore highly suited for
• Commodity Clusters
• Advanced Algorithms
• Wide community adoption and use
conda install -c conda-forge dask
pip install dask[complete] distributed --upgrade
67
Big DataSmall Data
Numba
Dask: From User Interaction to Execution
68
delayed
Dask: Parallel Data Processing
69
Synthetic views of
Numpy ndarrays
Synthetic views of
Pandas DataFrames
with HDFS support
DAG construction and
workflow manager
Dask is a Python parallel computing library that is:
• Familiar: Implements parallel NumPy and Pandas objects
• Fast: Optimized for demanding for numerical applications
• Flexible: for sophisticated and messy algorithms
• Scales up: Runs resiliently on clusters of 100s of machines
• Scales down: Pragmatic in a single process on a laptop
• Interactive: Responsive and fast for interactive data science
Dask complements the rest of Anaconda. It was developed with
NumPy, Pandas, and scikit-learn developers.
Overview of Dask
70
x.T - x.mean(axis=0)
df.groupby(df.index).value.mean()
def load(filename):
def clean(data):
def analyze(result):
Dask array (mimics NumPy)
Dask dataframe (mimics Pandas) Dask delayed (wraps custom code)
b.map(json.loads).foldby(...)
Dask bag (collection of data)
Dask Collections: Familiar Expressions and API
71
72
>>> import pandas as pd
>>> df = pd.read_csv('iris.csv')
>>> df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
>>> max_sepal_length_setosa = df[df.species
== 'setosa'].sepal_length.max()
5.7999999999999998
>>> import dask.dataframe as dd
>>> ddf = dd.read_csv('*.csv')
>>> ddf.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
…
>>> d_max_sepal_length_setosa = ddf[ddf.species
== 'setosa'].sepal_length.max()
>>> d_max_sepal_length_setosa.compute()
5.7999999999999998
Dask DataFrame is like Pandas
New Spark/Hadoop clusters
• Create and provision a Spark/Hadoop cluster with
a few simple steps
• Work on the cloud or with your existing in-house
servers
Dask Graphs: Example Machine Learning Pipeline
73
Example 1: Using Dask DataFrames on a cluster with CSV data
74
• Built from Pandas DataFrames
• Match Pandas interface
• Access data from HDFS, S3, local, etc.
• Fast, low latency
• Responsive user interface
75
>>> import numpy as np
>>> np_ones = np.ones((5000, 1000))
>>> np_ones
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> np_y = np.log(np_ones + 1)[:5].sum(axis=1)
>>> np_y
array([ 693.14718056, 693.14718056,
693.14718056, 693.14718056, 693.14718056])
>>> import dask.array as da
>>> da_ones = da.ones((5000000, 1000000),
chunks=(1000, 1000))
>>> da_ones.compute()
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> da_y = da.log(da_ones + 1)[:5].sum(axis=1)
>>> np_da_y = np.array(da_y) #fits in memory
array([ 693.14718056, 693.14718056,
693.14718056, 693.14718056, …, 693.14718056])
# If result doesn’t fit in memory
>>> da_y.to_hdf5('myfile.hdf5', 'result')
Dask Array is like NumPy
Example 3: Using Dask Arrays with global temperature data
76
• Built from NumPy
n-dimensional arrays
• Matches NumPy interface
(subset)
• Solve medium-large
problems
• Complex algorithms
Dask Schedulers: Distributed Scheduler
77
Cluster Architecture Diagram
78
Client Machine Compute
Node
Compute
Node
Compute
Node
Head Node
• Single machine with multiple threads or processes
• On a cluster with SSH (dcluster)
• Resource management: YARN (knit), SGE, Slurm
• On the cloud with Amazon EC2 (dec2)
• On a cluster with Anaconda for cluster management
• Manage multiple conda environments and packages
on bare-metal or cloud-based clusters
Using Anaconda and Dask on your Cluster
79
• The scheduler, workers, and clients pass messages between each other.
Semantically these messages encode commands, status updates, and
data, like the following:
• Please compute the function sum on the data x and store in y
• The computation y has been completed
• Be advised that a new worker named alice is available for use
• Here is the data for the keys 'x', and 'y'
Dask Protocol
80
In practice we represent these messages with dictionaries/mappings
• Protocol is a combination of msg pack for instructions and headers
• pickle/cloudpickle for arbitrary python objects
• byte strings (with optional compression) for large data.
• Prefer LZ4 to Snappy but either will be tried for messages above 1000B
• Protocol is extensible by registering serializers.
Dask Protocol
81
http://distributed.readthedocs.io/en/latest/protocol.html
Dask Protocol - Scheduler
82
However, the Scheduler never uses the language-specific serialization
and instead only deals with MsgPack. If the client sends a pickled
function up to the scheduler the scheduler will not unpack function but
will instead keep it as bytes. Eventually those bytes will be sent to a
worker, which will then unpack the bytes into a proper Python function.
Because the Scheduler never unpacks language-specific serialized bytes
it may be in a different language.
• The Scheduler is protected from unpickling unsafe code
• The Scheduler can be run under pypy for improved performance. This is only useful for
larger clusters.
• We could conceivably implement workers and clients for other languages (like R or Julia)
and reuse the Python scheduler. The worker and client code is fairly simple and much
easier to reimplement than the scheduler, which is complex.
• The scheduler might some day be rewritten in more heavily optimized C or Go
• Scheduling arbitrary graphs is hard.
• Optimal graph scheduling is NP-hard
• Scalable Scheduling requires Linear time solutions
• Fortunately dask does well with a lot of heuristics
• … and a lot of monitoring and data about sizes
• … and how long functions take.
Dask Scheduler
83
Scheduler Visualization with Bokeh
84
What makes Dask different?
Lets look at some pictures of directed graphs
Most Parallel Framework Architectures
User API
High Level Representation
Logical Plan
Low Level Representation
Physical Plan
Task scheduler
for execution
SQL Database Architecture
SELECT avg(value)
FROM accounts
INNER JOIN customers ON …
WHERE name == ‘Alice’
SQL Database Architecture
SELECT avg(value)
FROM accounts
WHERE name == ‘Alice’
INNER JOIN customers ON …
Optimize
Spark Architecture
df.join(df2, …)
.select(…)
.filter(…)
Optimize
Large Matrix Architecture
(A’ * A)  A’ * b
Optimize
Dask Architecture
Dask Architecture
accts=dd.read_parquet(…)
accts=accts[accts.name == ‘Alice’]
df=dd.merge(accts, customers)
df.value.mean().compute()
Dask Architecture
u, s, v = da.linalg.svd(X)
Y = u.dot(da.diag(s)).dot(v.T)
da.linalg.norm(X - y)
Dask Architecture
for i in range(256):
x = dask.delayed(f)(i)
y = dask.delayed(g)(x)
z = dask.delayed(add)(x, y
Dask Architecture
async def func():
client = await Client()
futures = client.map(…)
async for f in as_completed(…):
result = await f
Dask Architecture
Your own
system here
By dropping the high level representation
Costs
• Lose specialization
• Lose opportunities for high level optimization
Benefits
• Become generalists
• More flexibility for new domains and algorithms
• Access to smarter algorithms
• Better task scheduling
Resource constraints, GPUs, multiple clients,
async-real-time, etc..
Ten Reasons People
Choose Dask
Scalable Pandas DataFrames
• Same API
import dask.dataframe as dd
df = dd.read_parquet(‘s3://bucket/accounts/2017')
df.groupby(df.name).value.mean().compute()
• Efficient Timeseries Operations
df.loc[‘2017-01-01’] # Uses the Pandas index…
df.value.rolling(10).std() # for efficient…
df.value.resample(‘10m’).mean() # operations.
• Co-developed with Pandas
and by the Pandas developer community
Scalable NumPy Arrays
• Same API
import dask.array as da
x = da.from_array(my_hdf5_file)
y = x.dot(x.T)
• Applications
• Atmospheric science
• Satellite imagery
• Biomedical imagery
• Optimization algorithms
check out dask-glm
Parallelize Scikit-Learn/Joblib
• Scikit-Learn parallelizes with Joblib
estimator = RandomForest(…)
estimator.fit(train_data, train_labels, njobs=8)
• Joblib can use Dask
from sklearn.externals.joblib import parallel_backend
with parallel_backend('dask', scheduler=‘…’):
estimator.fit(train_data, train_labels)
https://pythonhosted.org/joblib/
http://distributed.readthedocs.io/en/latest/joblib.html
Joblib
Thread pool
Parallelize Scikit-Learn/Joblib
• Scikit-Learn parallelizes with Joblib
estimator = RandomForest(…)
estimator.fit(train_data, train_labels, njobs=8)
• Joblib can use Dask
from sklearn.externals.joblib import parallel_backend
with parallel_backend('dask', scheduler=‘…’):
estimator.fit(train_data, train_labels)
https://pythonhosted.org/joblib/
http://distributed.readthedocs.io/en/latest/joblib.html
Joblib
Dask
Many Other Libraries in Anaconda
• Scikit-Image uses dask to break down images and speed
up algorithms with overlapping regions
• Geopandas can use Dask to partition data
spatially and accelerate spatial joins
Dask Scales Up
• Thousand node clusters
• Cloud computing
• Super computers
• Gigabyte/s bandwidth
• 200 microsecond task overhead
Dask Scales Down (the median cluster size is one)
• Can run in a single Python thread pool
• Almost no performance penalty (microseconds)
• Lightweight
• Few dependencies
• Easy install
Parallelize Web Backends
• Web servers process thousands of small computations asynchronously
for web pages or REST endpoints
• Dask provides dynamic, heterogenous computation
• Supports small data
• 10ms roundtrip times
• Dynamic scaling for different loads
• Supports asynchronous Python (like GoLang)
async def serve(request):
future = dask_client.submit(process, request)
result = await future
return result
Debugging support
• Clean Python tracebacks when user code breaks
• Connect to remote workers with IPython sessions
for advanced debugging
Resource constraints
• Define limited hardware resources for workers
• Specify resource constraints when submitting tasks
$ dask-worker … —resources GPU=2
$ dask-worker … —resources GPU=2
$ dask-worker … —resources special-db=1
future = client.submit(my_function, resources={‘GPU’: 1})
• Used for GPUs, big-memory machines, special
hardware, database connections, I/O machines, etc..
Collaboration
• Many users can share the same cluster simultaneously
• Define public datasets
• Repeated computation and data use is shared among everyone
df = dd.read_parquet(…).persist()
client.publish_dataset(accounts=df)
df = client.get_dataset(‘accounts’)
Beautiful Diagnostic Dashboards
• Fast responsive dashboards
• Provide users performance insight
• Powered by Bokeh
Some Reasons not to
Choose Dask
• Dask is not a SQL database.
Does Pandas well, but won’t optimize complex queries.
• Dask is not MPI
Very fast, but does leave some performance on the table
200us task overhead
a couple copies in the network stack
• Dask is not a JVM technology
It’s a Python library
(although Julia bindings available)
• Dask is not always necessary
You may not need parallelism
Dask’s limitations
dask.pydata.org
conda install dask
© 2017 Anaconda, Inc. - Confidential & Proprietary
Scalable Machine Learning (ML)
Dask-ml — organized work for general scalable machine learning using dask
First release last week!
Organizes work done over the past year into a single home
A single place to discuss scalable machine learning with Python
https://tomaugspurger.github.io/scalable-ml-01
https://tomaugspurger.github.io/scalable-ml-02
https://github.com/dask/dask-ml/releases/tag/v0.2.3
conda install -c conda-forge dask-ml
https://tomaugspurger.github.io/dask-ml-announce

More Related Content

What's hot

Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentation
Ahmed rebai
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonWes McKinney
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
Simplilearn
 
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
Yu-Hsun (lymanblue) Lin
 
TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약
Jin Joong Kim
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
Mayur Bhangale
 
Intro to Python
Intro to PythonIntro to Python
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
深層学習フレームワーク概要とChainerの事例紹介
深層学習フレームワーク概要とChainerの事例紹介深層学習フレームワーク概要とChainerの事例紹介
深層学習フレームワーク概要とChainerの事例紹介
Kenta Oono
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numba
stan_seibert
 
Icpp power ai-workshop 2018
Icpp power ai-workshop 2018Icpp power ai-workshop 2018
Icpp power ai-workshop 2018
Ganesan Narayanasamy
 
Tensorflow
TensorflowTensorflow
Tensorflow
marwa Ayad Mohamed
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
S N
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Big Data Spain
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
Edureka!
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Altoros
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
Poo Kuan Hoong
 

What's hot (20)

Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentation
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
 
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
 
TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
深層学習フレームワーク概要とChainerの事例紹介
深層学習フレームワーク概要とChainerの事例紹介深層学習フレームワーク概要とChainerの事例紹介
深層学習フレームワーク概要とChainerの事例紹介
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numba
 
Icpp power ai-workshop 2018
Icpp power ai-workshop 2018Icpp power ai-workshop 2018
Icpp power ai-workshop 2018
 
Tensorflow
TensorflowTensorflow
Tensorflow
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 

Similar to Scaling Python to CPUs and GPUs

Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
Travis Oliphant
 
Pysense: wireless sensor computing in Python?
Pysense: wireless sensor computing in Python?Pysense: wireless sensor computing in Python?
Pysense: wireless sensor computing in Python?
Davide Carboni
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
Ian Ozsvald
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
Travis Oliphant
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Databricks
 
Ropython-windbg-python-extensions
Ropython-windbg-python-extensionsRopython-windbg-python-extensions
Ropython-windbg-python-extensions
Alin Gabriel Serdean
 
Connecting Python To The Spark Ecosystem
Connecting Python To The Spark EcosystemConnecting Python To The Spark Ecosystem
Connecting Python To The Spark Ecosystem
Spark Summit
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R
Kai Lichtenberg
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
Linux and Open Source in Math, Science and Engineering
Linux and Open Source in Math, Science and EngineeringLinux and Open Source in Math, Science and Engineering
Linux and Open Source in Math, Science and Engineering
PDE1D
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
Yu-Hsun (lymanblue) Lin
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
Travis Oliphant
 
Machine learning with py torch
Machine learning with py torchMachine learning with py torch
Machine learning with py torch
Riza Fahmi
 
Swift for tensorflow
Swift for tensorflowSwift for tensorflow
Swift for tensorflow
규영 허
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 

Similar to Scaling Python to CPUs and GPUs (20)

Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
Pysense: wireless sensor computing in Python?
Pysense: wireless sensor computing in Python?Pysense: wireless sensor computing in Python?
Pysense: wireless sensor computing in Python?
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Ropython-windbg-python-extensions
Ropython-windbg-python-extensionsRopython-windbg-python-extensions
Ropython-windbg-python-extensions
 
Connecting Python To The Spark Ecosystem
Connecting Python To The Spark EcosystemConnecting Python To The Spark Ecosystem
Connecting Python To The Spark Ecosystem
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Linux and Open Source in Math, Science and Engineering
Linux and Open Source in Math, Science and EngineeringLinux and Open Source in Math, Science and Engineering
Linux and Open Source in Math, Science and Engineering
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
 
Machine learning with py torch
Machine learning with py torchMachine learning with py torch
Machine learning with py torch
 
Swift for tensorflow
Swift for tensorflowSwift for tensorflow
Swift for tensorflow
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 

More from Travis Oliphant

PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
Travis Oliphant
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
Travis Oliphant
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
Travis Oliphant
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
Travis Oliphant
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
Travis Oliphant
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
Travis Oliphant
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
Travis Oliphant
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
Travis Oliphant
 
London level39
London level39London level39
London level39
Travis Oliphant
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for Python
Travis Oliphant
 
PyData Introduction
PyData IntroductionPyData Introduction
PyData Introduction
Travis Oliphant
 
Numba
NumbaNumba

More from Travis Oliphant (13)

PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
 
London level39
London level39London level39
London level39
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for Python
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
PyData Introduction
PyData IntroductionPyData Introduction
PyData Introduction
 
Numba
NumbaNumba
Numba
 

Recently uploaded

Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 

Recently uploaded (20)

Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 

Scaling Python to CPUs and GPUs

  • 1. Scaling Python to GPUs and CPUs Stanford Stats 285 October 30, 2017 Travis E. Oliphant President, Chief Data Scientist, Co-founder Anaconda, Inc. 1
  • 2. My Background 2 • MS/BS degrees in Elec. Comp. Engineering • PhD from Mayo Clinic in Biomedical Engineering (Ultrasound and MRI) • Creator and Developer of SciPy (1998-2009) • Professor at BYU (2001-2007) Inverse Problems • Creator and Developer of NumPy (2005-2012) • Started Numba (2012) • Founder of NumFOCUS / PyData • Python Software Foundation Director (2012) • Co-founder of Continuum Analytics => Anaconda, Inc. • CEO (2012) => Chief Data Scientist (2017) SciPy
  • 3. 3 Empower domain experts with high-level tools that exploit modern hard-ware Array Oriented Computing expertise
  • 4. • Express domain knowledge directly in arrays (tensors, matrices, vectors) --- easier to teach programming in domain • Can take advantage of parallelism and accelerators • Array expressions Why Array-oriented computing 4 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4 Object5 Object6
  • 5. • Today’s vector machines (and vector co-processors, or GPUS) were made for array-oriented computing. • The software stack has just not caught up --- unfortunate because APL came out in 1963. • There is a reason Fortran remains popular. More reasons for array-oriented 5
  • 6. 6 Python and in particular PyData is Growing
  • 12. 12
  • 14. Data Science Workflow 14 New Data NotebooksUnderstand Data Getting Data Understand World Reports Microservices Dashboards Applications Decisions and Actions Models Exploratory Data Analysis and Viz Data Products
  • 15. 15 New Data NotebooksUnderstand Data Getting Data Understand World Reports Microservices Dashboards Applications Decisions and Actions Models Exploratory Data Analysis and Viz Data Products Data Science Workflow
  • 16. Machine Learning Explosion 16 Scikit-Learn Tensorflow Keras XGBoost theano lasagne caffe/caffe2 torch mxnet / minpy neon CNTK DAAL Chainer Dynet Apache Singa Shogun https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose http://deeplearning.net/software_links/ http://scikit-learn.org/stable/related_projects.html
  • 17. 17 A Representation of Packages for ML © 2017 Anaconda, Inc. - Confidential & Proprietary
  • 18. 18 NumPy and Packages that Depend on It © 2017 Anaconda, Inc. - Confidential & Proprietary
  • 19. 19 pandas Depends on NumPy (and other packages depend on pandas) © 2017 Anaconda, Inc. - Confidential & Proprietary
  • 20. 20 Caffe Depends on pandas and NumPy © 2017 Anaconda, Inc. - Confidential & Proprietary
  • 21. Embrace Innovation Without Anarchy 21 From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft Reproducibility
  • 22. 22 Conda Conda Forge Conda Environments Anaconda Project A cross-platform and language agnostic package and environment manager A community-led collection of recipes, build infrastructure, and packages for conda. Custom isolated software sandboxes to allow easy reproducibility and sharing of data-science work. Reproducible, executable project directories
  • 23. • Language independent • Platform independent • No special privileges required • No VMs or containers • Enables: - Reproducibility - Collaboration - Scaling “conda – package everything” 23 A Python v2.7 Conda Sandboxing Technology B Python v3.4 Pandas v0.18 Jupyter C R R Essentials conda NumPy v1.11 NumPy v1.10 Pandas v0.16
  • 24. 24 $ anaconda-project run plot —show conda install tensorflow
  • 25. Basic Conda Usage 25 Install a package conda install sympy List all installed packages conda list Search for packages conda search llvm Create a new environment conda create -n py3k python=3 Remove a package conda remove nose Get help conda install --help
  • 26. Advanced Conda Usage 26 Install a package in an environment conda install -n py3k sympy Update all packages conda update --all Export list of packages conda list --export packages.txt Install packages from an export conda install --file packages.txt See package history conda list --revisions Revert to a revision conda install --revision 23 Remove unused packages and cached tarballs conda clean -pt
  • 29. Without NumPy 29 from math import sin, pi def sinc(x): if x == 0: return 1.0 else: pix = pi*x return sin(pix)/pix def step(x): if x > 0: return 1.0 elif x < 0: return 0.0 else: return 0.5 functions.py >>> import functions as f >>> xval = [x/3.0 for x in range(-10,10)] >>> yval1 = [f.sinc(x) for x in xval] >>> yval2 = [f.step(x) for x in xval] Python is a great language but needed a way to operate quickly and cleanly over multi- dimensional arrays.
  • 30. With NumPy 30 from numpy import sin, pi from numpy import vectorize import functions as f vsinc = vectorize(f.sinc) def sinc(x): pix = pi*x val = sin(pix)/pix val[x==0] = 1.0 return val vstep = vectorize(f.step) def step(x): y = x*0.0 y[x>0] = 1 y[x==0] = 0.5 return y >>> import functions2 as f >>> from numpy import * >>> x = r_[-10:10]/3.0 >>> y1 = f.sinc(x) >>> y2 = f.step(x) functions2.py Offers N-D array, element-by-element functions, and basic random numbers, linear algebra, and FFT capability for Python http://numpy.org Fiscally sponsored by NumFOCUS
  • 31. NumPy: an Array Extension of Python 31 • Data: the array object – slicing and shaping – data-type map to Bytes • Fast Math (ufuncs): – vectorization – broadcasting – aggregations
  • 32. shape NumPy Array 32 Key Attributes • dtype • shape • ndim • strides • data
  • 33. NumPy Examples 33 2d array 3d array [439 472 477] [217 205 261 222 245 238] 9.98330639789 2.96677717122
  • 34. NumPy Slicing (Selection) 34 >>> a[0,3:5] array([3, 4]) >>> a[4:,4:] array([[44, 45], [54, 55]]) >>> a[:,2] array([2,12,22,32,42,52]) >>> a[2::2,::2] array([[20, 22, 24], [40, 42, 44]])
  • 35. Summary 35 • Provides foundational N-dimensional array composed of homogeneous elements of a particular “dtype” • The dtype of the elements is extensive (but difficult to extend) • Arrays can be sliced and diced with simple syntax to provide easy manipulation and selection. • Provides fast and powerful math, statistics, and linear algebra functions that operate over arrays. • Utilities for sorting, reading and writing data also provided.
  • 36. Scaling Up and Out with Numba and Dask 36
  • 37. Scale Up vs Scale Out 37 Big Memory & Many Cores / GPU Box Best of Both (e.g. GPU Cluster) Many commodity nodes in a cluster ScaleUp (BiggerNodes) Scale Out (More Nodes) Numba Dask Dask with Numba
  • 38. © 2017 Anaconda, Inc. - Confidential & Proprietary Development Name Latest Release Number of Releases GitHub Stars Contributors Downloads in T12m numba 0.35 92 2647 74 1.8m dask 0.15.4 38 2001 112 1.2m dask-ml 0.3.1 6 104 4 Numba Dask Dask-ml http://numba.pydata.org http://github.com/numba http://dask.pydata.org http://github.com/dask http://dask-ml.readthedocs.io/en/latest/index.html http://github.com/dask/dask-ml
  • 40. Numba (compile Python to CPUs and GPUs) 40 conda install numba Intermediate Representatio n (IR) x86 ARM PTX Python LLVMNumba Code Generation Backend Parsing Frontend
  • 41. 41 @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up Image Processing
  • 42. Works with and does not replace the standard Python interpreter (all of your existing Python libraries are still available) Numba Features 42
  • 43. Example: Filter an array 43
  • 44. Example: Filter an array 44 Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array Numba decorator (nopython=True not required) 2.7x Speedup over NumPy!
  • 45. NumPy UFuncs and GUFuncs 45 NumPy ufuncs (and gufuncs) are functions that operate “element-wise” (or “sub- dimension-wise”) across an array without an explicit loop. This implicit loop (which is in machine code) is at the core of why NumPy is fast. Dispatch is done internally to a particular code-segment based on the type of the array. It is a very powerful abstraction in the PyData stack. Making new fast ufuncs used to be only possible in C — painful! With numba.vectorize and numba.guvectorize it is now easy! The inner secrets of NumPy are now at your finger-tips for you to make your own magic!
  • 46. Simple Ufunc 46 @vectorize def dot2(a,b,x,y): return a*x + b*y >>> a, b, x, y = np.random.randn(4,1000) >>> z = a * x + b * y >>> z2 = dot2(a, b, x, y) # faster Faster (especially) as N grows because it does not create temporaries. NumPy creates temporary arrays for intermediate results. Numba creates a fast machine-code kernel from the Python template and calls it for every element in the arrays.
  • 47. Generalized Ufunc 47 @guvectorize(‘f8[:], f8[:], f8[:]’, ‘(n),(n)->()’) def dot2(a,b,c): c[0]=a[0]*b[0] + a[1]*b[1] >>> a, b = np.random.randn(10000,2), np.random.randn(10000,2) >>> z1 = np.einsum(‘ij,ij->i’, a, b) >>> z2 = dot2(a, b) # uses last dimension as in each kernel This can create quite a bit of computation with very little code. Numba creates a fast machine-code kernel from the Python template and calls it for every element in the arrays. 3.8x faster
  • 48. 48 Perform a computation on a finite window of the input. For a linear system, this is a FIR filter and what np.convolve or sp.signal.lfilter can do. But what if you want some arbitrary computation like a windowed median filter. Example: Making a windowed compute filter
  • 49. 49 Hand-coded implementation Build a ufunc for the kernel which is faster for large arrays! This can now run easily on GPU with ‘target=cuda’ and many-cores ‘target=parallel’ Array-oriented! Example: Making a windowed compute filter
  • 50. 50 Example: Making a windowed compute filter
  • 51. 1. Create a realistic benchmark test case. (Do not use your unit tests as a benchmark!) 2. Run a profiler on your benchmark. (cProfile is a good choice) 3. Identify hotspots that could potentially be compiled by Numba with a little refactoring. (see online documentation) 4. Apply @numba.jit, @numba.vectorize, and @numba.guvectorize as needed to critical functions. (Small rewrites may be needed to work around Numba limitations.) 5. Re-run benchmark to check if there was a performance improvement. 6. Use target=parallel to get access to multiple cores (or target=cuda if you have a GPU) How to Use Numba 51
  • 52. How Numba works 52 Bytecode Analysis Python Function (bytecode) Function Arguments Type Inference Numba IR LLVM IR Machine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute! Rewrite IR Lowering LLVM JIT
  • 53. 7 things about Numba you may not know 53 1 2 3 4 5 6 7 Numba is 100% Open Source Numba + Jupyter = Rapid CUDA Prototyping Numba can compile for the CPU and the GPU at the same time Numba makes array processing easy with @(gu)vectorize Numba comes with a CUDA Simulator You can send Numba functions over the network Numba developers are working On a GPU DataFrame (pygdf)
  • 54. Numba is quite popular! 54 A numba mailing list reports experiments of a SciPy author who got 2x speed- up by removing their Cython type annotations and surrounding function with numba.jit (with a few minor changes needed to the code). With Numba’s ahead-of-time compilation one can use Numba to create a library that you ship to others (who then don’t need to have Numba installed). This is not as clean as it could be. SciPy (and NumPy) would look very different in Numba had existed 16 years ago when SciPy was getting started — and the PyPy crowd would be happier.
  • 55. Releasing the GIL 55 Only nopython mode functions can release the GIL
  • 56. Releasing the GIL 56 2.8x speedup with 4 cores
  • 57. CUDA Python (in open-source Numba!) 57 CUDA Development using Python syntax for optimal performance! 10-20x faster than CPU You have to understand CUDA at least a little — writing kernels that launch in parallel on the GPU
  • 58. Classic Example 58 from numba import jit @jit def mandel(x, y, max_iters): c = complex(x,y) z = 0j for i in range(max_iters): z = z*z + c if z.real * z.real + z.imag * z.imag >= 4: return 255 * i // max_iters return 255 Mandelbrot
  • 59. The Basics 59 CPython 1x Numpy array-wide operations 13x Numba (CPU) 120x Numba (NVidia Tesla K20c) 2100x Mandelbrot
  • 60. Other topics 60 CUDA Python — write general GPU kernels with Python Device Arrays — manage memory transfer from host to GPU Streaming — manage asynchronous and parallel GPU compute streams CUDA Simulator in Python — to help debug your kernels HSA Support — early support for HSA-based GPUs and APUs Pyculib — access to cuFFT, cuBLAS, cuSPARSE, cuRAND, CUDA Sorting https://github.com/ContinuumIO/gtc2017-numba
  • 62. • Designed to parallelize the Python ecosystem • Handles complex algorithms • Co-developed with Pandas/SKLearn/Jupyter teams • Familiar APIs for Python users • Scales • Scales from multicore to 1000-node clusters • Resilience, responsive, and real-time
  • 63. • Parallelizes NumPy, Pandas, SKLearn • Satisfies subset of these APIs • Uses these libraries internally • Co-developed with these teams • Task scheduler supports custom algorithms • Parallelize existing code • Build novel real-time systems • Arbitrary task graphs with data dependencies • Same scalability
  • 64. demo video • High level: Scaling Pandas • Same Pandas look and feel • Uses Pandas under the hood • Scales nicely onto many machines • Low level: Arbitrary task scheduling • Parallelize normal Python code • Build custom algorithms • React real-time • Demo deployed with • dask-kubernetes Google Compute Engine • github.com/dask/dask-kubernetes • Youtube link • https://www.youtube.com/watch?v=o ds97a5Pzw0&
  • 65. Why do people choose Dask? • Familiar with Python: • Drop-in NumPy/Pandas/SKLearn APIs • Native memory environment • Easy debugging and diagnostics • Have complex problems: • Parallelize existing code without expensive rewrites • Sophisticated algorithms and systems • Real-time response to small-data • Scales up and down: • Scales to 1000-node clusters • Also runs cheaply on a laptop #import pandas as pd import dask.dataframe as dd
  • 66. Dask 66 • Started as part of Blaze in early 2014. • General parallel programming engine • Flexible and therefore highly suited for • Commodity Clusters • Advanced Algorithms • Wide community adoption and use conda install -c conda-forge dask pip install dask[complete] distributed --upgrade
  • 68. Dask: From User Interaction to Execution 68 delayed
  • 69. Dask: Parallel Data Processing 69 Synthetic views of Numpy ndarrays Synthetic views of Pandas DataFrames with HDFS support DAG construction and workflow manager
  • 70. Dask is a Python parallel computing library that is: • Familiar: Implements parallel NumPy and Pandas objects • Fast: Optimized for demanding for numerical applications • Flexible: for sophisticated and messy algorithms • Scales up: Runs resiliently on clusters of 100s of machines • Scales down: Pragmatic in a single process on a laptop • Interactive: Responsive and fast for interactive data science Dask complements the rest of Anaconda. It was developed with NumPy, Pandas, and scikit-learn developers. Overview of Dask 70
  • 71. x.T - x.mean(axis=0) df.groupby(df.index).value.mean() def load(filename): def clean(data): def analyze(result): Dask array (mimics NumPy) Dask dataframe (mimics Pandas) Dask delayed (wraps custom code) b.map(json.loads).foldby(...) Dask bag (collection of data) Dask Collections: Familiar Expressions and API 71
  • 72. 72 >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 Dask DataFrame is like Pandas
  • 73. New Spark/Hadoop clusters • Create and provision a Spark/Hadoop cluster with a few simple steps • Work on the cloud or with your existing in-house servers Dask Graphs: Example Machine Learning Pipeline 73
  • 74. Example 1: Using Dask DataFrames on a cluster with CSV data 74 • Built from Pandas DataFrames • Match Pandas interface • Access data from HDFS, S3, local, etc. • Fast, low latency • Responsive user interface
  • 75. 75 >>> import numpy as np >>> np_ones = np.ones((5000, 1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # If result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result') Dask Array is like NumPy
  • 76. Example 3: Using Dask Arrays with global temperature data 76 • Built from NumPy n-dimensional arrays • Matches NumPy interface (subset) • Solve medium-large problems • Complex algorithms
  • 78. Cluster Architecture Diagram 78 Client Machine Compute Node Compute Node Compute Node Head Node
  • 79. • Single machine with multiple threads or processes • On a cluster with SSH (dcluster) • Resource management: YARN (knit), SGE, Slurm • On the cloud with Amazon EC2 (dec2) • On a cluster with Anaconda for cluster management • Manage multiple conda environments and packages on bare-metal or cloud-based clusters Using Anaconda and Dask on your Cluster 79
  • 80. • The scheduler, workers, and clients pass messages between each other. Semantically these messages encode commands, status updates, and data, like the following: • Please compute the function sum on the data x and store in y • The computation y has been completed • Be advised that a new worker named alice is available for use • Here is the data for the keys 'x', and 'y' Dask Protocol 80 In practice we represent these messages with dictionaries/mappings
  • 81. • Protocol is a combination of msg pack for instructions and headers • pickle/cloudpickle for arbitrary python objects • byte strings (with optional compression) for large data. • Prefer LZ4 to Snappy but either will be tried for messages above 1000B • Protocol is extensible by registering serializers. Dask Protocol 81 http://distributed.readthedocs.io/en/latest/protocol.html
  • 82. Dask Protocol - Scheduler 82 However, the Scheduler never uses the language-specific serialization and instead only deals with MsgPack. If the client sends a pickled function up to the scheduler the scheduler will not unpack function but will instead keep it as bytes. Eventually those bytes will be sent to a worker, which will then unpack the bytes into a proper Python function. Because the Scheduler never unpacks language-specific serialized bytes it may be in a different language. • The Scheduler is protected from unpickling unsafe code • The Scheduler can be run under pypy for improved performance. This is only useful for larger clusters. • We could conceivably implement workers and clients for other languages (like R or Julia) and reuse the Python scheduler. The worker and client code is fairly simple and much easier to reimplement than the scheduler, which is complex. • The scheduler might some day be rewritten in more heavily optimized C or Go
  • 83. • Scheduling arbitrary graphs is hard. • Optimal graph scheduling is NP-hard • Scalable Scheduling requires Linear time solutions • Fortunately dask does well with a lot of heuristics • … and a lot of monitoring and data about sizes • … and how long functions take. Dask Scheduler 83
  • 85. What makes Dask different? Lets look at some pictures of directed graphs
  • 86.
  • 87.
  • 88. Most Parallel Framework Architectures User API High Level Representation Logical Plan Low Level Representation Physical Plan Task scheduler for execution
  • 89. SQL Database Architecture SELECT avg(value) FROM accounts INNER JOIN customers ON … WHERE name == ‘Alice’
  • 90. SQL Database Architecture SELECT avg(value) FROM accounts WHERE name == ‘Alice’ INNER JOIN customers ON … Optimize
  • 92. Large Matrix Architecture (A’ * A) A’ * b Optimize
  • 94. Dask Architecture accts=dd.read_parquet(…) accts=accts[accts.name == ‘Alice’] df=dd.merge(accts, customers) df.value.mean().compute()
  • 95. Dask Architecture u, s, v = da.linalg.svd(X) Y = u.dot(da.diag(s)).dot(v.T) da.linalg.norm(X - y)
  • 96. Dask Architecture for i in range(256): x = dask.delayed(f)(i) y = dask.delayed(g)(x) z = dask.delayed(add)(x, y
  • 97. Dask Architecture async def func(): client = await Client() futures = client.map(…) async for f in as_completed(…): result = await f
  • 99. By dropping the high level representation Costs • Lose specialization • Lose opportunities for high level optimization Benefits • Become generalists • More flexibility for new domains and algorithms • Access to smarter algorithms • Better task scheduling Resource constraints, GPUs, multiple clients, async-real-time, etc..
  • 101. Scalable Pandas DataFrames • Same API import dask.dataframe as dd df = dd.read_parquet(‘s3://bucket/accounts/2017') df.groupby(df.name).value.mean().compute() • Efficient Timeseries Operations df.loc[‘2017-01-01’] # Uses the Pandas index… df.value.rolling(10).std() # for efficient… df.value.resample(‘10m’).mean() # operations. • Co-developed with Pandas and by the Pandas developer community
  • 102. Scalable NumPy Arrays • Same API import dask.array as da x = da.from_array(my_hdf5_file) y = x.dot(x.T) • Applications • Atmospheric science • Satellite imagery • Biomedical imagery • Optimization algorithms check out dask-glm
  • 103. Parallelize Scikit-Learn/Joblib • Scikit-Learn parallelizes with Joblib estimator = RandomForest(…) estimator.fit(train_data, train_labels, njobs=8) • Joblib can use Dask from sklearn.externals.joblib import parallel_backend with parallel_backend('dask', scheduler=‘…’): estimator.fit(train_data, train_labels) https://pythonhosted.org/joblib/ http://distributed.readthedocs.io/en/latest/joblib.html Joblib Thread pool
  • 104. Parallelize Scikit-Learn/Joblib • Scikit-Learn parallelizes with Joblib estimator = RandomForest(…) estimator.fit(train_data, train_labels, njobs=8) • Joblib can use Dask from sklearn.externals.joblib import parallel_backend with parallel_backend('dask', scheduler=‘…’): estimator.fit(train_data, train_labels) https://pythonhosted.org/joblib/ http://distributed.readthedocs.io/en/latest/joblib.html Joblib Dask
  • 105. Many Other Libraries in Anaconda • Scikit-Image uses dask to break down images and speed up algorithms with overlapping regions • Geopandas can use Dask to partition data spatially and accelerate spatial joins
  • 106. Dask Scales Up • Thousand node clusters • Cloud computing • Super computers • Gigabyte/s bandwidth • 200 microsecond task overhead Dask Scales Down (the median cluster size is one) • Can run in a single Python thread pool • Almost no performance penalty (microseconds) • Lightweight • Few dependencies • Easy install
  • 107. Parallelize Web Backends • Web servers process thousands of small computations asynchronously for web pages or REST endpoints • Dask provides dynamic, heterogenous computation • Supports small data • 10ms roundtrip times • Dynamic scaling for different loads • Supports asynchronous Python (like GoLang) async def serve(request): future = dask_client.submit(process, request) result = await future return result
  • 108. Debugging support • Clean Python tracebacks when user code breaks • Connect to remote workers with IPython sessions for advanced debugging
  • 109. Resource constraints • Define limited hardware resources for workers • Specify resource constraints when submitting tasks $ dask-worker … —resources GPU=2 $ dask-worker … —resources GPU=2 $ dask-worker … —resources special-db=1 future = client.submit(my_function, resources={‘GPU’: 1}) • Used for GPUs, big-memory machines, special hardware, database connections, I/O machines, etc..
  • 110. Collaboration • Many users can share the same cluster simultaneously • Define public datasets • Repeated computation and data use is shared among everyone df = dd.read_parquet(…).persist() client.publish_dataset(accounts=df) df = client.get_dataset(‘accounts’)
  • 111. Beautiful Diagnostic Dashboards • Fast responsive dashboards • Provide users performance insight • Powered by Bokeh
  • 112. Some Reasons not to Choose Dask
  • 113. • Dask is not a SQL database. Does Pandas well, but won’t optimize complex queries. • Dask is not MPI Very fast, but does leave some performance on the table 200us task overhead a couple copies in the network stack • Dask is not a JVM technology It’s a Python library (although Julia bindings available) • Dask is not always necessary You may not need parallelism Dask’s limitations
  • 115. © 2017 Anaconda, Inc. - Confidential & Proprietary Scalable Machine Learning (ML) Dask-ml — organized work for general scalable machine learning using dask First release last week! Organizes work done over the past year into a single home A single place to discuss scalable machine learning with Python https://tomaugspurger.github.io/scalable-ml-01 https://tomaugspurger.github.io/scalable-ml-02 https://github.com/dask/dask-ml/releases/tag/v0.2.3 conda install -c conda-forge dask-ml https://tomaugspurger.github.io/dask-ml-announce

Editor's Notes

  1. How do you recreate your analysis and models? With Anaconda you have the ability to encapsulate data science into packages, notebooks and entire environments
  2. Essential for technical conversations and SE team.
  3. Does not replace the standard Python interpreter! End user does not need a C or C++ compiler (Compiler required to compile Numba packages) < 70 MB package Uses LLVM to do final optimization and machine code generation. Supports wide range of Python language constructs and numeric data types.
  4. Does not replace the standard Python interpreter! End user does not need a C or C++ compiler (Compiler required to compile Numba packages) < 70 MB package Uses LLVM to do final optimization and machine code generation. Supports wide range of Python language constructs and numeric data types.