SlideShare a Scribd company logo
William Cox
PyColorado 2019
@gallamine
Background - Me
● William Cox
● North Carolina
○ twitter.com/gallamine
○ gallamine.com
● Building machine learning systems at Grubhub
○ Part of the Delivery team to delivery food around the country
○ Previously - Internet security industry and sonar systems
#2
@gallamine
Background - Grubhub
Grubhub Inc. is an American online and mobile food ordering and delivery
marketplace that connects diners with local takeout restaurants*.
#3
https://en.wikipedia.org/wiki/Grubhub
@gallamine
The Problem We’re Solving
● Every week we schedule drivers for timeslots.
● Too few drivers and diners are unhappy because they can’t get delivery
● Too many drivers
○ Drivers are unhappy because they’re idle and paid a base rate
○ Grubhub is unhappy because they’re paying for idle drivers
● We predict how many orders will happen for all regions so that an
appropriate number of drivers can be scheduled.
● My team designs and runs the predictions systems for Order Volume
Forecasting
#4
@gallamine
Daily Prediction Cycle
#5
Historic order data
Weather
Sports
...
Train model
Predict orders N
weeks into the future
@gallamine
How Do We Parallelize the Work?
● Long-term forecasting is a batch job (can take several hours to predict 3
weeks into the future)
● Creating multi-week predictions, for hundreds of different regions, for many
different models
● Need a system to do this in parallel across many machines
#6
Model 2Model 1 Model 3 Model N...
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M
@gallamine
Design Goals
● Prefer Python(ic)
● Prefer simplicity
● Prefer local testing / distributed deployment
● Prefer minimal changes to existing (largish) codebase (that I was unfamiliar
with)
Our problem needs heavy compute but not necessarily heavy data. Most of our
data will fit comfortably in memory.
#7
@gallamine
The Contenders
#8
@gallamine
Dask
● Familiar API
● Scales out to clusters
● Scales down to single computers

“Dask’s ability to write down arbitrary computational graphs Celery/Luigi/Airflow-
style and yet run them with the scalability promises of Hadoop/Spark allows for a
pleasant freedom to write comfortably and yet still compute scalably.“ M. Rocklin,
creator
#9
Dask provides ways to scale Pandas, Scikit-Learn, and
Numpy workflows with minimal rewriting.
● Integrates with the Python ecosystem
● Supports complex applications
● Responsive feedback
@gallamine
Dask
Dask use cases can be roughly divided in the following two categories:
1. Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to
analyze large datasets with familiar techniques. This is similar to
Databases, Spark, or big array libraries.
2. Custom task scheduling. You submit a graph of functions that depend on
each other for custom workloads. This is similar to Azkaban, Airflow, Celery,
or Makefiles
#10
https://docs.dask.org/en/latest/use-cases.html
@gallamine
Dask Quickstart
> pip install dask
#11
@gallamine
Dask Quickstart
def _forecast(group_name, static_param):
if group_name == "c":
raise ValueError("Bad group.")
# do work here
sleep_time = 1 + random.randint(1, 10)
time.sleep(sleep_time)
return sleep_time
#12
@gallamine
#13
from dask.distributed import Client, as_completed
import time
import random
if __name__ == "__main__":
client = Client()
predictions = []
for group in ["a", "b", "c", "d"]:
static_parameters = 1
fcast_future = client.submit(_forecast, group, static_parameters, pure=False)
predictions.append(fcast_future)
for future in as_completed(predictions, with_results=False):
try:
print(f"future {future.key} returned {future.result()}")
except ValueError as e:
print(e)
“The concurrent.futures module provides a high-level
interface for asynchronously executing callables.” Dask implements
this interface
Arbitrary function we’re scheduling
@gallamine
#14
@gallamine
Dask Distributed - Local
cluster = LocalCluster(
processes=USE_DASK_LOCAL_PROCESSES,
n_workers=1,
threads_per_worker=DASK_THREADS_PER_WORKER,
memory_limit='auto'
)
client = Client(cluster)
cluster.scale(DASK_LOCAL_WORKER_INSTANCES)
client.submit(…)
#15
@gallamine
Show Dask UI Local/Cluster
#16
@gallamine
Dask Distributed on YARN
● Dask workers are started in YARN containers
● Lets you allocate compute/memory resources on a cluster
● Files are distributed via HDFS
● HDFS lets you distribute files across a cluster
#17
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Dask works nicely with Hadoop to create and
manage Dask workers.
Lets you scale Dask to many computers on a
network.
Can also do: Kubernetes, SSH, GCP …
@gallamine
worker = skein.Service(
instances=config.dask_worker_instances,
max_restarts=10,
resources=skein.Resources(
memory=config.dask_worker_memory,
vcores=config.dask_worker_vcores
),
files={
'./cachedvolume': skein.File(
source=config.volume_sqlite3_filename, type='file'
)
},
env={'THEANO_FLAGS': 'base_compiledir=/tmp/.theano/',
'WORKER_VOL_LOCATION': './cachedvolume',},
script='dask-yarn services worker',
depends=['dask.scheduler']
)
Program-
matically
Describe
Service
#18
@gallamine
#19
scheduler = skein.Service(
resources=skein.Resources(
memory=config.dask_scheduler_memory,
vcores=config.dask_scheduler_vcores
),
script='dask-yarn services scheduler'
)
spec = skein.ApplicationSpec(
name=yarn_app_name,
queue='default',
services={
'dask.worker': worker,
'dask.scheduler': scheduler
}
)
@gallamine
Distributed Code Looks Identical to Local
for gid, url, region_ids in groups:
futures.append(cluster_client.submit(_forecast, forecast_periods,
model_id, region_ids, start_time,
end_time, url, testset))
for done_forecast_job in as_completed(futures, with_results=False):
try:
fcast_data = done_forecast_job.result()
except Exception as error:
# Error handling …
#20
@gallamine
Worker Logging / Observation
Cluster UI URL: cluster.application_client.ui.address
if reset_loggers:
# When workers start the reset logging function will be executed first.
client.register_worker_callbacks(setup=init.reset_logger)
#21
Stdout and stderr logs are captured by YARN.
@gallamine
Helpful - Debugging Wrapper
● Wrap Dask functions so that they can be turned off for debugging code
serially
● Code in Appendix slides
#22
Big ML
● SKLearn integration
● XGBoost / TensorFlow
● Works to hand off data to existing
distributed workflows
from dask.distributed import Client
client = Client() # start a local Dask client
import dask_ml.joblib
from sklearn.externals.joblib import parallel_backend
with parallel_backend('dask'):
# Your normal scikit-learn code here
Works with joblib
@gallamine
Big Data
● For dealing with large tabular data Dask has
distributed dataframes - Pandas + Dask
● For large numeric data Dask Arrays - Numpy +
Dask
● For large unstructured data Dask Bags
“Pythonic version of the PySpark RDD."
#24
@gallamine
Takeaways
● Forecasting now scales with number of computers in cluster! 50%
savings also in single-node compute.
● For distributing work across computers, Dask is a good place to start
investigating.
● YARN complicates matters
○ But I don’t know that something else (Kubernetes) would be better
○ The Dask website has good documentation
○ The Dask maintainers answer Stackoverflow questions quickly.
○ Dask is a complex library with lots of different abilities. This was just one use-
case among many.
○ We’re hiring!
#25
@gallamine
Questions?
#26
@gallamine
Debugging Wrapper - Appendix
class DebugClient:
def submit(self, func, *args, **kwargs):
f = futures.Future()
try:
f.set_result(self._execute_function(func, *args,
**kwargs))
return f
except Exception as e:
f.set_exception(e)
return f
def _execute_function(self, func, *args, **kwargs):
try:
return func(*args, **kwargs)
except Exception:
raise
#27
def as_completed(fcast_futures, with_results):
if not config.dask_debug_mode:
return dask_as_completed(fcast_futures,
with_results=with_results)
else:
return list(fcast_futures)
@gallamine
● “Dask is really just a smashing together of Python’s networking stack
with its data science stack. Most of the work was already done by the
time we got here.” - M. Rocklin
#28
https://notamonadtutorial.com/interview-with-dasks-creator-scale-your-python-from-one-computer-to-a-thousand-b4483376f200

More Related Content

What's hot

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Databricks
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Spark Summit
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Big Data Spain
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0
Shi Shao Feng
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
HostedbyConfluent
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
confluent
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
Thomas Weise
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc Bourlier
Spark Summit
 
Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked in
Yi Pan
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier Leaute
Databricks
 

What's hot (20)

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc Bourlier
 
Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked in
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier Leaute
 

Similar to Dask and Machine Learning Models in Production - PyColorado 2019

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0
Sumanth Chinthagunta
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
Matthew Rocklin
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application size
Keval Patel
 
Drupal performance and scalability
Drupal performance and scalabilityDrupal performance and scalability
Drupal performance and scalability
Twinbit
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
Vu Thi Trang
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
DCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker ContainersDCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker Containers
Docker, Inc.
 

Similar to Dask and Machine Learning Models in Production - PyColorado 2019 (20)

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application size
 
Drupal performance and scalability
Drupal performance and scalabilityDrupal performance and scalability
Drupal performance and scalability
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
DCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker ContainersDCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker Containers
 

Recently uploaded

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 

Recently uploaded (20)

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 

Dask and Machine Learning Models in Production - PyColorado 2019

  • 2. @gallamine Background - Me ● William Cox ● North Carolina ○ twitter.com/gallamine ○ gallamine.com ● Building machine learning systems at Grubhub ○ Part of the Delivery team to delivery food around the country ○ Previously - Internet security industry and sonar systems #2
  • 3. @gallamine Background - Grubhub Grubhub Inc. is an American online and mobile food ordering and delivery marketplace that connects diners with local takeout restaurants*. #3 https://en.wikipedia.org/wiki/Grubhub
  • 4. @gallamine The Problem We’re Solving ● Every week we schedule drivers for timeslots. ● Too few drivers and diners are unhappy because they can’t get delivery ● Too many drivers ○ Drivers are unhappy because they’re idle and paid a base rate ○ Grubhub is unhappy because they’re paying for idle drivers ● We predict how many orders will happen for all regions so that an appropriate number of drivers can be scheduled. ● My team designs and runs the predictions systems for Order Volume Forecasting #4
  • 5. @gallamine Daily Prediction Cycle #5 Historic order data Weather Sports ... Train model Predict orders N weeks into the future
  • 6. @gallamine How Do We Parallelize the Work? ● Long-term forecasting is a batch job (can take several hours to predict 3 weeks into the future) ● Creating multi-week predictions, for hundreds of different regions, for many different models ● Need a system to do this in parallel across many machines #6 Model 2Model 1 Model 3 Model N... Region 1 Region 2 Region M Region 1 Region 2 Region M Region 1 Region 2 Region M Region 1 Region 2 Region M
  • 7. @gallamine Design Goals ● Prefer Python(ic) ● Prefer simplicity ● Prefer local testing / distributed deployment ● Prefer minimal changes to existing (largish) codebase (that I was unfamiliar with) Our problem needs heavy compute but not necessarily heavy data. Most of our data will fit comfortably in memory. #7
  • 9. @gallamine Dask ● Familiar API ● Scales out to clusters ● Scales down to single computers
 “Dask’s ability to write down arbitrary computational graphs Celery/Luigi/Airflow- style and yet run them with the scalability promises of Hadoop/Spark allows for a pleasant freedom to write comfortably and yet still compute scalably.“ M. Rocklin, creator #9 Dask provides ways to scale Pandas, Scikit-Learn, and Numpy workflows with minimal rewriting. ● Integrates with the Python ecosystem ● Supports complex applications ● Responsive feedback
  • 10. @gallamine Dask Dask use cases can be roughly divided in the following two categories: 1. Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to analyze large datasets with familiar techniques. This is similar to Databases, Spark, or big array libraries. 2. Custom task scheduling. You submit a graph of functions that depend on each other for custom workloads. This is similar to Azkaban, Airflow, Celery, or Makefiles #10 https://docs.dask.org/en/latest/use-cases.html
  • 12. @gallamine Dask Quickstart def _forecast(group_name, static_param): if group_name == "c": raise ValueError("Bad group.") # do work here sleep_time = 1 + random.randint(1, 10) time.sleep(sleep_time) return sleep_time #12
  • 13. @gallamine #13 from dask.distributed import Client, as_completed import time import random if __name__ == "__main__": client = Client() predictions = [] for group in ["a", "b", "c", "d"]: static_parameters = 1 fcast_future = client.submit(_forecast, group, static_parameters, pure=False) predictions.append(fcast_future) for future in as_completed(predictions, with_results=False): try: print(f"future {future.key} returned {future.result()}") except ValueError as e: print(e) “The concurrent.futures module provides a high-level interface for asynchronously executing callables.” Dask implements this interface Arbitrary function we’re scheduling
  • 15. @gallamine Dask Distributed - Local cluster = LocalCluster( processes=USE_DASK_LOCAL_PROCESSES, n_workers=1, threads_per_worker=DASK_THREADS_PER_WORKER, memory_limit='auto' ) client = Client(cluster) cluster.scale(DASK_LOCAL_WORKER_INSTANCES) client.submit(…) #15
  • 16. @gallamine Show Dask UI Local/Cluster #16
  • 17. @gallamine Dask Distributed on YARN ● Dask workers are started in YARN containers ● Lets you allocate compute/memory resources on a cluster ● Files are distributed via HDFS ● HDFS lets you distribute files across a cluster #17 https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html Dask works nicely with Hadoop to create and manage Dask workers. Lets you scale Dask to many computers on a network. Can also do: Kubernetes, SSH, GCP …
  • 18. @gallamine worker = skein.Service( instances=config.dask_worker_instances, max_restarts=10, resources=skein.Resources( memory=config.dask_worker_memory, vcores=config.dask_worker_vcores ), files={ './cachedvolume': skein.File( source=config.volume_sqlite3_filename, type='file' ) }, env={'THEANO_FLAGS': 'base_compiledir=/tmp/.theano/', 'WORKER_VOL_LOCATION': './cachedvolume',}, script='dask-yarn services worker', depends=['dask.scheduler'] ) Program- matically Describe Service #18
  • 19. @gallamine #19 scheduler = skein.Service( resources=skein.Resources( memory=config.dask_scheduler_memory, vcores=config.dask_scheduler_vcores ), script='dask-yarn services scheduler' ) spec = skein.ApplicationSpec( name=yarn_app_name, queue='default', services={ 'dask.worker': worker, 'dask.scheduler': scheduler } )
  • 20. @gallamine Distributed Code Looks Identical to Local for gid, url, region_ids in groups: futures.append(cluster_client.submit(_forecast, forecast_periods, model_id, region_ids, start_time, end_time, url, testset)) for done_forecast_job in as_completed(futures, with_results=False): try: fcast_data = done_forecast_job.result() except Exception as error: # Error handling … #20
  • 21. @gallamine Worker Logging / Observation Cluster UI URL: cluster.application_client.ui.address if reset_loggers: # When workers start the reset logging function will be executed first. client.register_worker_callbacks(setup=init.reset_logger) #21 Stdout and stderr logs are captured by YARN.
  • 22. @gallamine Helpful - Debugging Wrapper ● Wrap Dask functions so that they can be turned off for debugging code serially ● Code in Appendix slides #22
  • 23. Big ML ● SKLearn integration ● XGBoost / TensorFlow ● Works to hand off data to existing distributed workflows from dask.distributed import Client client = Client() # start a local Dask client import dask_ml.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask'): # Your normal scikit-learn code here Works with joblib
  • 24. @gallamine Big Data ● For dealing with large tabular data Dask has distributed dataframes - Pandas + Dask ● For large numeric data Dask Arrays - Numpy + Dask ● For large unstructured data Dask Bags “Pythonic version of the PySpark RDD." #24
  • 25. @gallamine Takeaways ● Forecasting now scales with number of computers in cluster! 50% savings also in single-node compute. ● For distributing work across computers, Dask is a good place to start investigating. ● YARN complicates matters ○ But I don’t know that something else (Kubernetes) would be better ○ The Dask website has good documentation ○ The Dask maintainers answer Stackoverflow questions quickly. ○ Dask is a complex library with lots of different abilities. This was just one use- case among many. ○ We’re hiring! #25
  • 27. @gallamine Debugging Wrapper - Appendix class DebugClient: def submit(self, func, *args, **kwargs): f = futures.Future() try: f.set_result(self._execute_function(func, *args, **kwargs)) return f except Exception as e: f.set_exception(e) return f def _execute_function(self, func, *args, **kwargs): try: return func(*args, **kwargs) except Exception: raise #27 def as_completed(fcast_futures, with_results): if not config.dask_debug_mode: return dask_as_completed(fcast_futures, with_results=with_results) else: return list(fcast_futures)
  • 28. @gallamine ● “Dask is really just a smashing together of Python’s networking stack with its data science stack. Most of the work was already done by the time we got here.” - M. Rocklin #28 https://notamonadtutorial.com/interview-with-dasks-creator-scale-your-python-from-one-computer-to-a-thousand-b4483376f200