Dask and Machine Learning Models in Production - PyColorado 2019

@gallamine
Background - Me
● William Cox
● North Carolina
○ twitter.com/gallamine
○ gallamine.com
● Building machine learning systems at Grubhub
○ Part of the Delivery team to delivery food around the country
○ Previously - Internet security industry and sonar systems
#2

@gallamine
Background - Grubhub
Grubhub Inc. is an American online and mobile food ordering and delivery
marketplace that connects diners with local takeout restaurants*.
#3
https://en.wikipedia.org/wiki/Grubhub

@gallamine
The Problem We’re Solving
● Every week we schedule drivers for timeslots.
● Too few drivers and diners are unhappy because they can’t get delivery
● Too many drivers
○ Drivers are unhappy because they’re idle and paid a base rate
○ Grubhub is unhappy because they’re paying for idle drivers
● We predict how many orders will happen for all regions so that an
appropriate number of drivers can be scheduled.
● My team designs and runs the predictions systems for Order Volume
Forecasting
#4

@gallamine
Daily Prediction Cycle
#5
Historic order data
Weather
Sports
...
Train model
Predict orders N
weeks into the future

@gallamine
How Do We Parallelize the Work?
● Long-term forecasting is a batch job (can take several hours to predict 3
weeks into the future)
● Creating multi-week predictions, for hundreds of different regions, for many
different models
● Need a system to do this in parallel across many machines
#6
Model 2Model 1 Model 3 Model N...
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M

@gallamine
Design Goals
● Prefer Python(ic)
● Prefer simplicity
● Prefer local testing / distributed deployment
● Prefer minimal changes to existing (largish) codebase (that I was unfamiliar
with)
Our problem needs heavy compute but not necessarily heavy data. Most of our
data will fit comfortably in memory.
#7

@gallamine
Dask
● Familiar API
● Scales out to clusters
● Scales down to single computers 
“Dask’s ability to write down arbitrary computational graphs Celery/Luigi/Airflow-
style and yet run them with the scalability promises of Hadoop/Spark allows for a
pleasant freedom to write comfortably and yet still compute scalably.“ M. Rocklin,
creator
#9
Dask provides ways to scale Pandas, Scikit-Learn, and
Numpy workflows with minimal rewriting.
● Integrates with the Python ecosystem
● Supports complex applications
● Responsive feedback

@gallamine
Dask
Dask use cases can be roughly divided in the following two categories:
1. Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to
analyze large datasets with familiar techniques. This is similar to
Databases, Spark, or big array libraries.
2. Custom task scheduling. You submit a graph of functions that depend on
each other for custom workloads. This is similar to Azkaban, Airflow, Celery,
or Makefiles
#10
https://docs.dask.org/en/latest/use-cases.html

@gallamine
Dask Quickstart
> pip install dask
#11

@gallamine
Dask Quickstart
def _forecast(group_name, static_param):
if group_name == "c":
raise ValueError("Bad group.")
# do work here
sleep_time = 1 + random.randint(1, 10)
time.sleep(sleep_time)
return sleep_time
#12

@gallamine
#13
from dask.distributed import Client, as_completed
import time
import random
if __name__ == "__main__":
client = Client()
predictions = []
for group in ["a", "b", "c", "d"]:
static_parameters = 1
fcast_future = client.submit(_forecast, group, static_parameters, pure=False)
predictions.append(fcast_future)
for future in as_completed(predictions, with_results=False):
try:
print(f"future {future.key} returned {future.result()}")
except ValueError as e:
print(e)
“The concurrent.futures module provides a high-level
interface for asynchronously executing callables.” Dask implements
this interface
Arbitrary function we’re scheduling

@gallamine
Dask Distributed - Local
cluster = LocalCluster(
processes=USE_DASK_LOCAL_PROCESSES,
n_workers=1,
threads_per_worker=DASK_THREADS_PER_WORKER,
memory_limit='auto'
)
client = Client(cluster)
cluster.scale(DASK_LOCAL_WORKER_INSTANCES)
client.submit(…)
#15

@gallamine
Show Dask UI Local/Cluster
#16

@gallamine
Dask Distributed on YARN
● Dask workers are started in YARN containers
● Lets you allocate compute/memory resources on a cluster
● Files are distributed via HDFS
● HDFS lets you distribute files across a cluster
#17
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Dask works nicely with Hadoop to create and
manage Dask workers.
Lets you scale Dask to many computers on a
network.
Can also do: Kubernetes, SSH, GCP …

@gallamine
worker = skein.Service(
instances=config.dask_worker_instances,
max_restarts=10,
resources=skein.Resources(
memory=config.dask_worker_memory,
vcores=config.dask_worker_vcores
),
files={
'./cachedvolume': skein.File(
source=config.volume_sqlite3_filename, type='file'
)
},
env={'THEANO_FLAGS': 'base_compiledir=/tmp/.theano/',
'WORKER_VOL_LOCATION': './cachedvolume',},
script='dask-yarn services worker',
depends=['dask.scheduler']
)
Program-
matically
Describe
Service
#18

@gallamine
#19
scheduler = skein.Service(
resources=skein.Resources(
memory=config.dask_scheduler_memory,
vcores=config.dask_scheduler_vcores
),
script='dask-yarn services scheduler'
)
spec = skein.ApplicationSpec(
name=yarn_app_name,
queue='default',
services={
'dask.worker': worker,
'dask.scheduler': scheduler
}
)

@gallamine
Distributed Code Looks Identical to Local
for gid, url, region_ids in groups:
futures.append(cluster_client.submit(_forecast, forecast_periods,
model_id, region_ids, start_time,
end_time, url, testset))
for done_forecast_job in as_completed(futures, with_results=False):
try:
fcast_data = done_forecast_job.result()
except Exception as error:
# Error handling …
#20

@gallamine
Worker Logging / Observation
Cluster UI URL: cluster.application_client.ui.address
if reset_loggers:
# When workers start the reset logging function will be executed first.
client.register_worker_callbacks(setup=init.reset_logger)
#21
Stdout and stderr logs are captured by YARN.

@gallamine
Helpful - Debugging Wrapper
● Wrap Dask functions so that they can be turned off for debugging code
serially
● Code in Appendix slides
#22

Big ML
● SKLearn integration
● XGBoost / TensorFlow
● Works to hand off data to existing
distributed workflows
from dask.distributed import Client
client = Client() # start a local Dask client
import dask_ml.joblib
from sklearn.externals.joblib import parallel_backend
with parallel_backend('dask'):
# Your normal scikit-learn code here
Works with joblib

@gallamine
Big Data
● For dealing with large tabular data Dask has
distributed dataframes - Pandas + Dask
● For large numeric data Dask Arrays - Numpy +
Dask
● For large unstructured data Dask Bags
“Pythonic version of the PySpark RDD."
#24

@gallamine
Takeaways
● Forecasting now scales with number of computers in cluster! 50%
savings also in single-node compute.
● For distributing work across computers, Dask is a good place to start
investigating.
● YARN complicates matters
○ But I don’t know that something else (Kubernetes) would be better
○ The Dask website has good documentation
○ The Dask maintainers answer Stackoverflow questions quickly.
○ Dask is a complex library with lots of different abilities. This was just one use-
case among many.
○ We’re hiring!
#25

@gallamine
Debugging Wrapper - Appendix
class DebugClient:
def submit(self, func, *args, **kwargs):
f = futures.Future()
try:
f.set_result(self._execute_function(func, *args,
**kwargs))
return f
except Exception as e:
f.set_exception(e)
return f
def _execute_function(self, func, *args, **kwargs):
try:
return func(*args, **kwargs)
except Exception:
raise
#27
def as_completed(fcast_futures, with_results):
if not config.dask_debug_mode:
return dask_as_completed(fcast_futures,
with_results=with_results)
else:
return list(fcast_futures)

@gallamine
● “Dask is really just a smashing together of Python’s networking stack
with its data science stack. Most of the work was already done by the
time we got here.” - M. Rocklin
#28
https://notamonadtutorial.com/interview-with-dasks-creator-scale-your-python-from-one-computer-to-a-thousand-b4483376f200

Dask and Machine Learning Models in Production - PyColorado 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dask and Machine Learning Models in Production - PyColorado 2019

Similar to Dask and Machine Learning Models in Production - PyColorado 2019 (20)

Recently uploaded

Recently uploaded (20)

Dask and Machine Learning Models in Production - PyColorado 2019