SlideShare a Scribd company logo
1 of 49
© 2017 Anaconda, Inc. - Confidential & Proprietary
Dask: Scaling Python
Matthew Rocklin @mrocklin
© 2017 Anaconda, Inc. - Confidential & Proprietary
Python is large
and growing
© 2017 Anaconda, Inc. - Confidential & Proprietary
https://stackoverflow.blog/2017/09/06/incredible-growth-python/
https://stackoverflow.blog/2017/09/14/python-growing-quickly/
Python’s Scientific Stack
Python’s Scientific Stack
Bokeh
Python’s Scientific Stack
Bokeh
Python’s Scientific Stack
Python’s Scientific Ecosystem
(and
many,
many
more)
Bokeh
(and
many,
many
more)
Bokeh
© 2017 Anaconda, Inc. - Confidential & Proprietary
Numeric Python’s virtues and vices
• Fast: Native code with C/C++/CUDA
• Intuitive: Long history with scientists and analysts
• Established: Trusted and well understood
• Broad: Packages for everything, community supported
• But wasn’t designed to scale:
• Limited to a single thread
• Limited to in-memory data
© 2017 Anaconda, Inc. - Confidential & Proprietary
How do we scale an
ecosystem?
From a parallel computing perspective
© 2017 Anaconda, Inc. - Confidential & Proprietary
• Designed to parallelize the Python ecosystem
• Flexible parallel computing paradigm
• Familiar APIs for Python users
• Co-developed with Pandas/SKLearn/Jupyter teams
• Scales
• Scales from multicore to 1000-node clusters
• Resilience, responsive, and real-time
© 2017 Anaconda, Inc. - Confidential & Proprietary
• High Level: Parallel NumPy, Pandas, ML
• Satisfies subset of these APIs
• Uses these libraries internally
• Co-developed with these teams
• Low Level: Task scheduling for arbitrary execution
• Parallelize existing code
• Build novel real-time systems
• Arbitrary task graphs
with data dependencies
• Same scalability
© 2017 Anaconda, Inc. - Confidential & Proprietary
demo
• High level: Scaling Pandas
• Same Pandas look and feel
• Uses Pandas under the hood
• Scales nicely onto many machines
• Low level: Arbitrary task scheduling
• Parallelize normal Python code
• Build custom algorithms
• React real-time
• Demo deployed with
• dask-kubernetes
Google Compute Engine
• github.com/dask/dask-kubernetes
• Youtube link
• https://www.youtube.com/watch?v=o
ds97a5Pzw0&
© 2017 Anaconda, Inc. - Confidential & Proprietary
What makes Dask different?
© 2017 Anaconda, Inc. - Confidential & Proprietary
Most Parallel Frameworks
Follow the following architecture
1. High level user-facing API
like the SQL language, or Linear Algebra
2. Medium level query plan
For databases/Spark: Big data map-steps, shuffle-steps, and aggregation-steps
For arrays: Matrix multiplies, transposes, slicing
3. Low-level task graph
Read 100MB chunk of data, run black-box function on it
4. Execution system
Run task 9352 on worker 32, move data x-123 to worker 26
Flow from higher to lower level abstractions
© 2017 Anaconda, Inc. - Confidential & Proprietary
Most Parallel Framework Architectures
User API
High Level Representation
Logical Plan
Low Level Representation
Physical Plan
Task scheduler
for execution
© 2017 Anaconda, Inc. - Confidential & Proprietary
SQL Database Architecture
SELECT avg(value)
FROM accounts
INNER JOIN customers ON …
WHERE name == ‘Alice’
© 2017 Anaconda, Inc. - Confidential & Proprietary
SQL Database Architecture
SELECT avg(value)
FROM accounts
WHERE name == ‘Alice’
INNER JOIN customers ON …
Optimize
© 2017 Anaconda, Inc. - Confidential & Proprietary
Spark Architecture
df.join(df2, …)
.select(…)
.filter(…)
Optimize
© 2017 Anaconda, Inc. - Confidential & Proprietary
Large Matrix Architecture
(A’ * A)  A’ * b
Optimize
© 2017 Anaconda, Inc. - Confidential & Proprietary
Dask Architecture
accts=dd.read_parquet(…)
accts=accts[accts.name == ‘Alice’]
df=dd.merge(accts, customers)
df.value.mean().compute()
Dask doesn’t have a high-level abstraction
Dask can’t optimize
But Dask is general to many domains
© 2017 Anaconda, Inc. - Confidential & Proprietary
Dask Architecture
u, s, v = da.linalg.svd(X)
Y = u.dot(da.diag(s)).dot(v.T)
da.linalg.norm(X - y)
© 2017 Anaconda, Inc. - Confidential & Proprietary
Dask Architecture
for i in range(256):
x = dask.delayed(f)(i)
y = dask.delayed(g)(x)
z = dask.delayed(add)(x, y
© 2017 Anaconda, Inc. - Confidential & Proprietary
Dask Architecture
async def func():
client = await Client()
futures = client.map(…)
async for f in as_completed(…):
result = await f
© 2017 Anaconda, Inc. - Confidential & Proprietary
Dask Architecture
Your own
system here
© 2017 Anaconda, Inc. - Confidential & Proprietary
High-level representations are
powerful
But they also box you in
© 2017 Anaconda, Inc. - Confidential & Proprietary
Spark
Map stage
Shuffle stage
Reduce stage
Dask
© 2017 Anaconda, Inc. - Confidential & Proprietary
DaskSpark
Map stage
Shuffle stage
Reduce stage
© 2017 Anaconda, Inc. - Confidential & Proprietary
By dropping the high level representation
Costs
• Lose specialization
• Lose opportunities for high level optimization
Benefits
• Become generalists
• More flexibility for new domains and algorithms
• Access to smarter algorithms
• Better task scheduling
Resource constraints, GPUs, multiple clients,
async-real-time, etc..
© 2017 Anaconda, Inc. - Confidential & Proprietary
Ten Reasons People
Choose Dask
© 2017 Anaconda, Inc. - Confidential & Proprietary
1. Scalable Pandas DataFrames
• Same API
import dask.dataframe as dd
df = dd.read_parquet(‘s3://bucket/accounts/2017')
df.groupby(df.name).value.mean().compute()
• Efficient Timeseries Operations
# Use the pandas index for efficient
operations
df.loc[‘2017-01-01’]
df.value.rolling(10).std()
df.value.resample(‘10m’).mean()
• Co-developed with Pandas
and by the Pandas developer community
© 2017 Anaconda, Inc. - Confidential & Proprietary
2. Scalable NumPy Arrays
• Same API
import dask.array as da
x = da.from_array(my_hdf5_file)
y = x.dot(x.T)
• Applications
• Atmospheric science
• Satellite imagery
• Biomedical imagery
• Optimization algorithms
check out dask-glm
© 2017 Anaconda, Inc. - Confidential & Proprietary
3. Parallelize Scikit-Learn/Joblib
• Scikit-Learn parallelizes with Joblib
estimator = RandomForest(…)
estimator.fit(train_data, train_labels, njobs=8)
• Joblib can use Dask
from sklearn.externals.joblib import parallel_backend
with parallel_backend('dask', scheduler=‘…’):
estimator.fit(train_data, train_labels)
https://pythonhosted.org/joblib/
http://distributed.readthedocs.io/en/latest/joblib.html
Joblib
Thread pool
© 2017 Anaconda, Inc. - Confidential & Proprietary
3. Parallelize Scikit-Learn/Joblib
• Scikit-Learn parallelizes with Joblib
estimator = RandomForest(…)
estimator.fit(train_data, train_labels, njobs=8)
• Joblib can use Dask
from sklearn.externals.joblib import parallel_backend
with parallel_backend('dask', scheduler=‘…’):
estimator.fit(train_data, train_labels)
https://pythonhosted.org/joblib/
http://distributed.readthedocs.io/en/latest/joblib.html
Joblib
Dask
© 2017 Anaconda, Inc. - Confidential & Proprietary
4. Parallelize Existing Codebases
• Parallelize custom code with minimal intrusion
results = {}
for x in X:
for y in Y:
if x < y:
result = f(x, y)
else:
result = g(x, y)
results.append(result)
• Good for algorithm researchers
• Good for enterprises with entrenched business logic
M Tepper, G Sapiro “Compressed nonnegative
matrix factorization is fast and accurate”,
IEEE Transactions on Signal Processing, 2016
© 2017 Anaconda, Inc. - Confidential & Proprietary
4. Parallelize Existing Codebases
• Parallelize custom code with minimal intrusion
f = dask.delayed(f)
g = dask.delayed(g)
results = {}
for x in X:
for y in Y:
if x < y:
result = f(x, y)
else:
result = g(x, y)
results.append(result)
result = dask.compute(results)
• Good for algorithm researchers
• Good for enterprises with entrenched business logic
M Tepper, G Sapiro “Compressed nonnegative
matrix factorization is fast and accurate”,
IEEE Transactions on Signal Processing, 2016
© 2017 Anaconda, Inc. - Confidential & Proprietary
5. Many Other Libraries in Anaconda
• Scikit-Image uses Dask to break down images and
accelerate algorithms with overlapping regions
• Geopandas can scale with Dask
• Spatial partitioning
• Accelerate spatial joins
• (new work)
© 2017 Anaconda, Inc. - Confidential & Proprietary
6. Dask Scales Up
• Thousand node clusters
• Cloud computing
• Super computers
• Gigabyte/s bandwidth
• 200 microsecond task overhead
Dask Scales Down (the median cluster size is one)
• Can run in a single Python thread pool
• Almost no performance penalty (microseconds)
• Lightweight
• Few dependencies
• Easy install
© 2017 Anaconda, Inc. - Confidential & Proprietary
7. Parallelize Web Backends
• Web servers process thousands of small computations asynchronously
for web pages or REST endpoints
• Dask provides dynamic, heterogenous computation
• Supports small data
• 10ms roundtrip times
• Dynamic scaling for different loads
• Supports asynchronous Python (like GoLang)
async def serve(request):
future = dask_client.submit(process, request)
result = await future
return result
© 2017 Anaconda, Inc. - Confidential & Proprietary
8. Debugging support
• Clean Python tracebacks when user code breaks
• Connect to remote workers with IPython sessions
for advanced debugging
© 2017 Anaconda, Inc. - Confidential & Proprietary
9. Resource constraints
• Define limited hardware resources for workers
• Specify resource constraints when submitting tasks
$ dask-worker … —resources GPU=2
$ dask-worker … —resources GPU=2
$ dask-worker … —resources special-db=1
dask.compute(…, resources={ x: {’GPU’: 1},
read: {‘special-db’: 1})
• Used for GPUs, big-memory machines, special
hardware, database connections, I/O machines, etc..
© 2017 Anaconda, Inc. - Confidential & Proprietary
10. Beautiful Diagnostic Dashboards
• Fast responsive dashboards
• Provide users performance insight
• Powered by Bokeh
Bokeh
© 2017 Anaconda, Inc. - Confidential & Proprietary
Some Reasons not to
Choose Dask
© 2017 Anaconda, Inc. - Confidential & Proprietary
• Dask is not a SQL database.
Does Pandas well, but won’t optimize complex queries
• Dask is not a JVM technology
It’s a Python library
(although Julia bindings are available)
• Dask is not a monolithic framework
You’ll have to install Pandas, SKLearn and others as well
Dask is small, designed to complement existing systems
• Parallelism is not always necessary
Use simple solutions if feasible
Dask’s limitations
© 2017 Anaconda, Inc. - Confidential & Proprietary
Why do people choose Dask?
• Familiar with Python:
• Drop-in NumPy/Pandas/SKLearn APIs
• Native memory environment
• Easy debugging and diagnostics
• Have complex problems:
• Parallelize existing code without expensive rewrites
• Sophisticated algorithms and systems
• Real-time response to small-data
• Scales up and down:
• Scales to 1000-node clusters
• Also runs cheaply on a laptop
#import pandas as pd
import dask.dataframe as dd
© 2017 Anaconda, Inc. - Confidential & Proprietary
Thank you for your time
Questions?
© 2017 Anaconda, Inc. - Confidential & Proprietary
dask.pydata.org
conda install dask
© 2017 Anaconda, Inc. - Confidential & Proprietary

More Related Content

What's hot

Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
confluent
 

What's hot (20)

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Advanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAdvanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applications
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 

Similar to Dask: Scaling Python

Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-full
Jim Dowling
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 

Similar to Dask: Scaling Python (20)

Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
20171104 hk-py con-mysql-documentstore_v1
20171104 hk-py con-mysql-documentstore_v120171104 hk-py con-mysql-documentstore_v1
20171104 hk-py con-mysql-documentstore_v1
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyter
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
 
Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-full
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
The Rise of DataOps: Making Big Data Bite Size with DataOps
The Rise of DataOps: Making Big Data Bite Size with DataOpsThe Rise of DataOps: Making Big Data Bite Size with DataOps
The Rise of DataOps: Making Big Data Bite Size with DataOps
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Just one-shade-of-openstack
Just one-shade-of-openstackJust one-shade-of-openstack
Just one-shade-of-openstack
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
deep learning in production cff 2017
deep learning in production cff 2017deep learning in production cff 2017
deep learning in production cff 2017
 

Recently uploaded

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Dask: Scaling Python

  • 1. © 2017 Anaconda, Inc. - Confidential & Proprietary Dask: Scaling Python Matthew Rocklin @mrocklin
  • 2. © 2017 Anaconda, Inc. - Confidential & Proprietary Python is large and growing
  • 3. © 2017 Anaconda, Inc. - Confidential & Proprietary https://stackoverflow.blog/2017/09/06/incredible-growth-python/ https://stackoverflow.blog/2017/09/14/python-growing-quickly/
  • 10. © 2017 Anaconda, Inc. - Confidential & Proprietary Numeric Python’s virtues and vices • Fast: Native code with C/C++/CUDA • Intuitive: Long history with scientists and analysts • Established: Trusted and well understood • Broad: Packages for everything, community supported • But wasn’t designed to scale: • Limited to a single thread • Limited to in-memory data
  • 11. © 2017 Anaconda, Inc. - Confidential & Proprietary How do we scale an ecosystem? From a parallel computing perspective
  • 12. © 2017 Anaconda, Inc. - Confidential & Proprietary • Designed to parallelize the Python ecosystem • Flexible parallel computing paradigm • Familiar APIs for Python users • Co-developed with Pandas/SKLearn/Jupyter teams • Scales • Scales from multicore to 1000-node clusters • Resilience, responsive, and real-time
  • 13. © 2017 Anaconda, Inc. - Confidential & Proprietary • High Level: Parallel NumPy, Pandas, ML • Satisfies subset of these APIs • Uses these libraries internally • Co-developed with these teams • Low Level: Task scheduling for arbitrary execution • Parallelize existing code • Build novel real-time systems • Arbitrary task graphs with data dependencies • Same scalability
  • 14. © 2017 Anaconda, Inc. - Confidential & Proprietary demo • High level: Scaling Pandas • Same Pandas look and feel • Uses Pandas under the hood • Scales nicely onto many machines • Low level: Arbitrary task scheduling • Parallelize normal Python code • Build custom algorithms • React real-time • Demo deployed with • dask-kubernetes Google Compute Engine • github.com/dask/dask-kubernetes • Youtube link • https://www.youtube.com/watch?v=o ds97a5Pzw0&
  • 15. © 2017 Anaconda, Inc. - Confidential & Proprietary What makes Dask different?
  • 16. © 2017 Anaconda, Inc. - Confidential & Proprietary Most Parallel Frameworks Follow the following architecture 1. High level user-facing API like the SQL language, or Linear Algebra 2. Medium level query plan For databases/Spark: Big data map-steps, shuffle-steps, and aggregation-steps For arrays: Matrix multiplies, transposes, slicing 3. Low-level task graph Read 100MB chunk of data, run black-box function on it 4. Execution system Run task 9352 on worker 32, move data x-123 to worker 26 Flow from higher to lower level abstractions
  • 17. © 2017 Anaconda, Inc. - Confidential & Proprietary Most Parallel Framework Architectures User API High Level Representation Logical Plan Low Level Representation Physical Plan Task scheduler for execution
  • 18. © 2017 Anaconda, Inc. - Confidential & Proprietary SQL Database Architecture SELECT avg(value) FROM accounts INNER JOIN customers ON … WHERE name == ‘Alice’
  • 19. © 2017 Anaconda, Inc. - Confidential & Proprietary SQL Database Architecture SELECT avg(value) FROM accounts WHERE name == ‘Alice’ INNER JOIN customers ON … Optimize
  • 20. © 2017 Anaconda, Inc. - Confidential & Proprietary Spark Architecture df.join(df2, …) .select(…) .filter(…) Optimize
  • 21. © 2017 Anaconda, Inc. - Confidential & Proprietary Large Matrix Architecture (A’ * A) A’ * b Optimize
  • 22. © 2017 Anaconda, Inc. - Confidential & Proprietary Dask Architecture accts=dd.read_parquet(…) accts=accts[accts.name == ‘Alice’] df=dd.merge(accts, customers) df.value.mean().compute() Dask doesn’t have a high-level abstraction Dask can’t optimize But Dask is general to many domains
  • 23. © 2017 Anaconda, Inc. - Confidential & Proprietary Dask Architecture u, s, v = da.linalg.svd(X) Y = u.dot(da.diag(s)).dot(v.T) da.linalg.norm(X - y)
  • 24. © 2017 Anaconda, Inc. - Confidential & Proprietary Dask Architecture for i in range(256): x = dask.delayed(f)(i) y = dask.delayed(g)(x) z = dask.delayed(add)(x, y
  • 25. © 2017 Anaconda, Inc. - Confidential & Proprietary Dask Architecture async def func(): client = await Client() futures = client.map(…) async for f in as_completed(…): result = await f
  • 26. © 2017 Anaconda, Inc. - Confidential & Proprietary Dask Architecture Your own system here
  • 27. © 2017 Anaconda, Inc. - Confidential & Proprietary High-level representations are powerful But they also box you in
  • 28. © 2017 Anaconda, Inc. - Confidential & Proprietary Spark Map stage Shuffle stage Reduce stage Dask
  • 29. © 2017 Anaconda, Inc. - Confidential & Proprietary DaskSpark Map stage Shuffle stage Reduce stage
  • 30. © 2017 Anaconda, Inc. - Confidential & Proprietary By dropping the high level representation Costs • Lose specialization • Lose opportunities for high level optimization Benefits • Become generalists • More flexibility for new domains and algorithms • Access to smarter algorithms • Better task scheduling Resource constraints, GPUs, multiple clients, async-real-time, etc..
  • 31. © 2017 Anaconda, Inc. - Confidential & Proprietary Ten Reasons People Choose Dask
  • 32. © 2017 Anaconda, Inc. - Confidential & Proprietary 1. Scalable Pandas DataFrames • Same API import dask.dataframe as dd df = dd.read_parquet(‘s3://bucket/accounts/2017') df.groupby(df.name).value.mean().compute() • Efficient Timeseries Operations # Use the pandas index for efficient operations df.loc[‘2017-01-01’] df.value.rolling(10).std() df.value.resample(‘10m’).mean() • Co-developed with Pandas and by the Pandas developer community
  • 33. © 2017 Anaconda, Inc. - Confidential & Proprietary 2. Scalable NumPy Arrays • Same API import dask.array as da x = da.from_array(my_hdf5_file) y = x.dot(x.T) • Applications • Atmospheric science • Satellite imagery • Biomedical imagery • Optimization algorithms check out dask-glm
  • 34. © 2017 Anaconda, Inc. - Confidential & Proprietary 3. Parallelize Scikit-Learn/Joblib • Scikit-Learn parallelizes with Joblib estimator = RandomForest(…) estimator.fit(train_data, train_labels, njobs=8) • Joblib can use Dask from sklearn.externals.joblib import parallel_backend with parallel_backend('dask', scheduler=‘…’): estimator.fit(train_data, train_labels) https://pythonhosted.org/joblib/ http://distributed.readthedocs.io/en/latest/joblib.html Joblib Thread pool
  • 35. © 2017 Anaconda, Inc. - Confidential & Proprietary 3. Parallelize Scikit-Learn/Joblib • Scikit-Learn parallelizes with Joblib estimator = RandomForest(…) estimator.fit(train_data, train_labels, njobs=8) • Joblib can use Dask from sklearn.externals.joblib import parallel_backend with parallel_backend('dask', scheduler=‘…’): estimator.fit(train_data, train_labels) https://pythonhosted.org/joblib/ http://distributed.readthedocs.io/en/latest/joblib.html Joblib Dask
  • 36. © 2017 Anaconda, Inc. - Confidential & Proprietary 4. Parallelize Existing Codebases • Parallelize custom code with minimal intrusion results = {} for x in X: for y in Y: if x < y: result = f(x, y) else: result = g(x, y) results.append(result) • Good for algorithm researchers • Good for enterprises with entrenched business logic M Tepper, G Sapiro “Compressed nonnegative matrix factorization is fast and accurate”, IEEE Transactions on Signal Processing, 2016
  • 37. © 2017 Anaconda, Inc. - Confidential & Proprietary 4. Parallelize Existing Codebases • Parallelize custom code with minimal intrusion f = dask.delayed(f) g = dask.delayed(g) results = {} for x in X: for y in Y: if x < y: result = f(x, y) else: result = g(x, y) results.append(result) result = dask.compute(results) • Good for algorithm researchers • Good for enterprises with entrenched business logic M Tepper, G Sapiro “Compressed nonnegative matrix factorization is fast and accurate”, IEEE Transactions on Signal Processing, 2016
  • 38. © 2017 Anaconda, Inc. - Confidential & Proprietary 5. Many Other Libraries in Anaconda • Scikit-Image uses Dask to break down images and accelerate algorithms with overlapping regions • Geopandas can scale with Dask • Spatial partitioning • Accelerate spatial joins • (new work)
  • 39. © 2017 Anaconda, Inc. - Confidential & Proprietary 6. Dask Scales Up • Thousand node clusters • Cloud computing • Super computers • Gigabyte/s bandwidth • 200 microsecond task overhead Dask Scales Down (the median cluster size is one) • Can run in a single Python thread pool • Almost no performance penalty (microseconds) • Lightweight • Few dependencies • Easy install
  • 40. © 2017 Anaconda, Inc. - Confidential & Proprietary 7. Parallelize Web Backends • Web servers process thousands of small computations asynchronously for web pages or REST endpoints • Dask provides dynamic, heterogenous computation • Supports small data • 10ms roundtrip times • Dynamic scaling for different loads • Supports asynchronous Python (like GoLang) async def serve(request): future = dask_client.submit(process, request) result = await future return result
  • 41. © 2017 Anaconda, Inc. - Confidential & Proprietary 8. Debugging support • Clean Python tracebacks when user code breaks • Connect to remote workers with IPython sessions for advanced debugging
  • 42. © 2017 Anaconda, Inc. - Confidential & Proprietary 9. Resource constraints • Define limited hardware resources for workers • Specify resource constraints when submitting tasks $ dask-worker … —resources GPU=2 $ dask-worker … —resources GPU=2 $ dask-worker … —resources special-db=1 dask.compute(…, resources={ x: {’GPU’: 1}, read: {‘special-db’: 1}) • Used for GPUs, big-memory machines, special hardware, database connections, I/O machines, etc..
  • 43. © 2017 Anaconda, Inc. - Confidential & Proprietary 10. Beautiful Diagnostic Dashboards • Fast responsive dashboards • Provide users performance insight • Powered by Bokeh Bokeh
  • 44. © 2017 Anaconda, Inc. - Confidential & Proprietary Some Reasons not to Choose Dask
  • 45. © 2017 Anaconda, Inc. - Confidential & Proprietary • Dask is not a SQL database. Does Pandas well, but won’t optimize complex queries • Dask is not a JVM technology It’s a Python library (although Julia bindings are available) • Dask is not a monolithic framework You’ll have to install Pandas, SKLearn and others as well Dask is small, designed to complement existing systems • Parallelism is not always necessary Use simple solutions if feasible Dask’s limitations
  • 46. © 2017 Anaconda, Inc. - Confidential & Proprietary Why do people choose Dask? • Familiar with Python: • Drop-in NumPy/Pandas/SKLearn APIs • Native memory environment • Easy debugging and diagnostics • Have complex problems: • Parallelize existing code without expensive rewrites • Sophisticated algorithms and systems • Real-time response to small-data • Scales up and down: • Scales to 1000-node clusters • Also runs cheaply on a laptop #import pandas as pd import dask.dataframe as dd
  • 47. © 2017 Anaconda, Inc. - Confidential & Proprietary Thank you for your time Questions?
  • 48. © 2017 Anaconda, Inc. - Confidential & Proprietary dask.pydata.org conda install dask
  • 49. © 2017 Anaconda, Inc. - Confidential & Proprietary