William Cox
PyColorado 2019
@gallamine
Background - Me
● William Cox
● North Carolina
○ twitter.com/gallamine
○ gallamine.com
● Building machine learning systems at Grubhub
○ Part of the Delivery team to delivery food around the country
○ Previously - Internet security industry and sonar systems
#2
@gallamine
Background - Grubhub
Grubhub Inc. is an American online and mobile food ordering and delivery
marketplace that connects diners with local takeout restaurants*.
#3
https://en.wikipedia.org/wiki/Grubhub
@gallamine
The Problem We’re Solving
● Every week we schedule drivers for timeslots.
● Too few drivers and diners are unhappy because they can’t get delivery
● Too many drivers
○ Drivers are unhappy because they’re idle and paid a base rate
○ Grubhub is unhappy because they’re paying for idle drivers
● We predict how many orders will happen for all regions so that an
appropriate number of drivers can be scheduled.
● My team designs and runs the predictions systems for Order Volume
Forecasting
#4
@gallamine
Daily Prediction Cycle
#5
Historic order data
Weather
Sports
...
Train model
Predict orders N
weeks into the future
@gallamine
How Do We Parallelize the Work?
● Long-term forecasting is a batch job (can take several hours to predict 3
weeks into the future)
● Creating multi-week predictions, for hundreds of different regions, for many
different models
● Need a system to do this in parallel across many machines
#6
Model 2Model 1 Model 3 Model N...
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M
Region 1
Region 2
Region M
@gallamine
Design Goals
● Prefer Python(ic)
● Prefer simplicity
● Prefer local testing / distributed deployment
● Prefer minimal changes to existing (largish) codebase (that I was unfamiliar
with)
Our problem needs heavy compute but not necessarily heavy data. Most of our
data will fit comfortably in memory.
#7
@gallamine
The Contenders
#8
@gallamine
Dask
● Familiar API
● Scales out to clusters
● Scales down to single computers

“Dask’s ability to write down arbitrary computational graphs Celery/Luigi/Airflow-
style and yet run them with the scalability promises of Hadoop/Spark allows for a
pleasant freedom to write comfortably and yet still compute scalably.“ M. Rocklin,
creator
#9
Dask provides ways to scale Pandas, Scikit-Learn, and
Numpy workflows with minimal rewriting.
● Integrates with the Python ecosystem
● Supports complex applications
● Responsive feedback
@gallamine
Dask
Dask use cases can be roughly divided in the following two categories:
1. Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to
analyze large datasets with familiar techniques. This is similar to
Databases, Spark, or big array libraries.
2. Custom task scheduling. You submit a graph of functions that depend on
each other for custom workloads. This is similar to Azkaban, Airflow, Celery,
or Makefiles
#10
https://docs.dask.org/en/latest/use-cases.html
@gallamine
Dask Quickstart
> pip install dask
#11
@gallamine
Dask Quickstart
def _forecast(group_name, static_param):
if group_name == "c":
raise ValueError("Bad group.")
# do work here
sleep_time = 1 + random.randint(1, 10)
time.sleep(sleep_time)
return sleep_time
#12
@gallamine
#13
from dask.distributed import Client, as_completed
import time
import random
if __name__ == "__main__":
client = Client()
predictions = []
for group in ["a", "b", "c", "d"]:
static_parameters = 1
fcast_future = client.submit(_forecast, group, static_parameters, pure=False)
predictions.append(fcast_future)
for future in as_completed(predictions, with_results=False):
try:
print(f"future {future.key} returned {future.result()}")
except ValueError as e:
print(e)
“The concurrent.futures module provides a high-level
interface for asynchronously executing callables.” Dask implements
this interface
Arbitrary function we’re scheduling
@gallamine
#14
@gallamine
Dask Distributed - Local
cluster = LocalCluster(
processes=USE_DASK_LOCAL_PROCESSES,
n_workers=1,
threads_per_worker=DASK_THREADS_PER_WORKER,
memory_limit='auto'
)
client = Client(cluster)
cluster.scale(DASK_LOCAL_WORKER_INSTANCES)
client.submit(…)
#15
@gallamine
Show Dask UI Local/Cluster
#16
@gallamine
Dask Distributed on YARN
● Dask workers are started in YARN containers
● Lets you allocate compute/memory resources on a cluster
● Files are distributed via HDFS
● HDFS lets you distribute files across a cluster
#17
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Dask works nicely with Hadoop to create and
manage Dask workers.
Lets you scale Dask to many computers on a
network.
Can also do: Kubernetes, SSH, GCP …
@gallamine
worker = skein.Service(
instances=config.dask_worker_instances,
max_restarts=10,
resources=skein.Resources(
memory=config.dask_worker_memory,
vcores=config.dask_worker_vcores
),
files={
'./cachedvolume': skein.File(
source=config.volume_sqlite3_filename, type='file'
)
},
env={'THEANO_FLAGS': 'base_compiledir=/tmp/.theano/',
'WORKER_VOL_LOCATION': './cachedvolume',},
script='dask-yarn services worker',
depends=['dask.scheduler']
)
Program-
matically
Describe
Service
#18
@gallamine
#19
scheduler = skein.Service(
resources=skein.Resources(
memory=config.dask_scheduler_memory,
vcores=config.dask_scheduler_vcores
),
script='dask-yarn services scheduler'
)
spec = skein.ApplicationSpec(
name=yarn_app_name,
queue='default',
services={
'dask.worker': worker,
'dask.scheduler': scheduler
}
)
@gallamine
Distributed Code Looks Identical to Local
for gid, url, region_ids in groups:
futures.append(cluster_client.submit(_forecast, forecast_periods,
model_id, region_ids, start_time,
end_time, url, testset))
for done_forecast_job in as_completed(futures, with_results=False):
try:
fcast_data = done_forecast_job.result()
except Exception as error:
# Error handling …
#20
@gallamine
Worker Logging / Observation
Cluster UI URL: cluster.application_client.ui.address
if reset_loggers:
# When workers start the reset logging function will be executed first.
client.register_worker_callbacks(setup=init.reset_logger)
#21
Stdout and stderr logs are captured by YARN.
@gallamine
Helpful - Debugging Wrapper
● Wrap Dask functions so that they can be turned off for debugging code
serially
● Code in Appendix slides
#22
Big ML
● SKLearn integration
● XGBoost / TensorFlow
● Works to hand off data to existing
distributed workflows
from dask.distributed import Client
client = Client() # start a local Dask client
import dask_ml.joblib
from sklearn.externals.joblib import parallel_backend
with parallel_backend('dask'):
# Your normal scikit-learn code here
Works with joblib
@gallamine
Big Data
● For dealing with large tabular data Dask has
distributed dataframes - Pandas + Dask
● For large numeric data Dask Arrays - Numpy +
Dask
● For large unstructured data Dask Bags
“Pythonic version of the PySpark RDD."
#24
@gallamine
Takeaways
● Forecasting now scales with number of computers in cluster! 50%
savings also in single-node compute.
● For distributing work across computers, Dask is a good place to start
investigating.
● YARN complicates matters
○ But I don’t know that something else (Kubernetes) would be better
○ The Dask website has good documentation
○ The Dask maintainers answer Stackoverflow questions quickly.
○ Dask is a complex library with lots of different abilities. This was just one use-
case among many.
○ We’re hiring!
#25
@gallamine
Questions?
#26
@gallamine
Debugging Wrapper - Appendix
class DebugClient:
def submit(self, func, *args, **kwargs):
f = futures.Future()
try:
f.set_result(self._execute_function(func, *args,
**kwargs))
return f
except Exception as e:
f.set_exception(e)
return f
def _execute_function(self, func, *args, **kwargs):
try:
return func(*args, **kwargs)
except Exception:
raise
#27
def as_completed(fcast_futures, with_results):
if not config.dask_debug_mode:
return dask_as_completed(fcast_futures,
with_results=with_results)
else:
return list(fcast_futures)
@gallamine
● “Dask is really just a smashing together of Python’s networking stack
with its data science stack. Most of the work was already done by the
time we got here.” - M. Rocklin
#28
https://notamonadtutorial.com/interview-with-dasks-creator-scale-your-python-from-one-computer-to-a-thousand-b4483376f200

Dask and Machine Learning Models in Production - PyColorado 2019

  • 1.
  • 2.
    @gallamine Background - Me ●William Cox ● North Carolina ○ twitter.com/gallamine ○ gallamine.com ● Building machine learning systems at Grubhub ○ Part of the Delivery team to delivery food around the country ○ Previously - Internet security industry and sonar systems #2
  • 3.
    @gallamine Background - Grubhub GrubhubInc. is an American online and mobile food ordering and delivery marketplace that connects diners with local takeout restaurants*. #3 https://en.wikipedia.org/wiki/Grubhub
  • 4.
    @gallamine The Problem We’reSolving ● Every week we schedule drivers for timeslots. ● Too few drivers and diners are unhappy because they can’t get delivery ● Too many drivers ○ Drivers are unhappy because they’re idle and paid a base rate ○ Grubhub is unhappy because they’re paying for idle drivers ● We predict how many orders will happen for all regions so that an appropriate number of drivers can be scheduled. ● My team designs and runs the predictions systems for Order Volume Forecasting #4
  • 5.
    @gallamine Daily Prediction Cycle #5 Historicorder data Weather Sports ... Train model Predict orders N weeks into the future
  • 6.
    @gallamine How Do WeParallelize the Work? ● Long-term forecasting is a batch job (can take several hours to predict 3 weeks into the future) ● Creating multi-week predictions, for hundreds of different regions, for many different models ● Need a system to do this in parallel across many machines #6 Model 2Model 1 Model 3 Model N... Region 1 Region 2 Region M Region 1 Region 2 Region M Region 1 Region 2 Region M Region 1 Region 2 Region M
  • 7.
    @gallamine Design Goals ● PreferPython(ic) ● Prefer simplicity ● Prefer local testing / distributed deployment ● Prefer minimal changes to existing (largish) codebase (that I was unfamiliar with) Our problem needs heavy compute but not necessarily heavy data. Most of our data will fit comfortably in memory. #7
  • 8.
  • 9.
    @gallamine Dask ● Familiar API ●Scales out to clusters ● Scales down to single computers
 “Dask’s ability to write down arbitrary computational graphs Celery/Luigi/Airflow- style and yet run them with the scalability promises of Hadoop/Spark allows for a pleasant freedom to write comfortably and yet still compute scalably.“ M. Rocklin, creator #9 Dask provides ways to scale Pandas, Scikit-Learn, and Numpy workflows with minimal rewriting. ● Integrates with the Python ecosystem ● Supports complex applications ● Responsive feedback
  • 10.
    @gallamine Dask Dask use casescan be roughly divided in the following two categories: 1. Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to analyze large datasets with familiar techniques. This is similar to Databases, Spark, or big array libraries. 2. Custom task scheduling. You submit a graph of functions that depend on each other for custom workloads. This is similar to Azkaban, Airflow, Celery, or Makefiles #10 https://docs.dask.org/en/latest/use-cases.html
  • 11.
  • 12.
    @gallamine Dask Quickstart def _forecast(group_name,static_param): if group_name == "c": raise ValueError("Bad group.") # do work here sleep_time = 1 + random.randint(1, 10) time.sleep(sleep_time) return sleep_time #12
  • 13.
    @gallamine #13 from dask.distributed importClient, as_completed import time import random if __name__ == "__main__": client = Client() predictions = [] for group in ["a", "b", "c", "d"]: static_parameters = 1 fcast_future = client.submit(_forecast, group, static_parameters, pure=False) predictions.append(fcast_future) for future in as_completed(predictions, with_results=False): try: print(f"future {future.key} returned {future.result()}") except ValueError as e: print(e) “The concurrent.futures module provides a high-level interface for asynchronously executing callables.” Dask implements this interface Arbitrary function we’re scheduling
  • 14.
  • 15.
    @gallamine Dask Distributed -Local cluster = LocalCluster( processes=USE_DASK_LOCAL_PROCESSES, n_workers=1, threads_per_worker=DASK_THREADS_PER_WORKER, memory_limit='auto' ) client = Client(cluster) cluster.scale(DASK_LOCAL_WORKER_INSTANCES) client.submit(…) #15
  • 16.
    @gallamine Show Dask UILocal/Cluster #16
  • 17.
    @gallamine Dask Distributed onYARN ● Dask workers are started in YARN containers ● Lets you allocate compute/memory resources on a cluster ● Files are distributed via HDFS ● HDFS lets you distribute files across a cluster #17 https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html Dask works nicely with Hadoop to create and manage Dask workers. Lets you scale Dask to many computers on a network. Can also do: Kubernetes, SSH, GCP …
  • 18.
    @gallamine worker = skein.Service( instances=config.dask_worker_instances, max_restarts=10, resources=skein.Resources( memory=config.dask_worker_memory, vcores=config.dask_worker_vcores ), files={ './cachedvolume':skein.File( source=config.volume_sqlite3_filename, type='file' ) }, env={'THEANO_FLAGS': 'base_compiledir=/tmp/.theano/', 'WORKER_VOL_LOCATION': './cachedvolume',}, script='dask-yarn services worker', depends=['dask.scheduler'] ) Program- matically Describe Service #18
  • 19.
    @gallamine #19 scheduler = skein.Service( resources=skein.Resources( memory=config.dask_scheduler_memory, vcores=config.dask_scheduler_vcores ), script='dask-yarnservices scheduler' ) spec = skein.ApplicationSpec( name=yarn_app_name, queue='default', services={ 'dask.worker': worker, 'dask.scheduler': scheduler } )
  • 20.
    @gallamine Distributed Code LooksIdentical to Local for gid, url, region_ids in groups: futures.append(cluster_client.submit(_forecast, forecast_periods, model_id, region_ids, start_time, end_time, url, testset)) for done_forecast_job in as_completed(futures, with_results=False): try: fcast_data = done_forecast_job.result() except Exception as error: # Error handling … #20
  • 21.
    @gallamine Worker Logging /Observation Cluster UI URL: cluster.application_client.ui.address if reset_loggers: # When workers start the reset logging function will be executed first. client.register_worker_callbacks(setup=init.reset_logger) #21 Stdout and stderr logs are captured by YARN.
  • 22.
    @gallamine Helpful - DebuggingWrapper ● Wrap Dask functions so that they can be turned off for debugging code serially ● Code in Appendix slides #22
  • 23.
    Big ML ● SKLearnintegration ● XGBoost / TensorFlow ● Works to hand off data to existing distributed workflows from dask.distributed import Client client = Client() # start a local Dask client import dask_ml.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask'): # Your normal scikit-learn code here Works with joblib
  • 24.
    @gallamine Big Data ● Fordealing with large tabular data Dask has distributed dataframes - Pandas + Dask ● For large numeric data Dask Arrays - Numpy + Dask ● For large unstructured data Dask Bags “Pythonic version of the PySpark RDD." #24
  • 25.
    @gallamine Takeaways ● Forecasting nowscales with number of computers in cluster! 50% savings also in single-node compute. ● For distributing work across computers, Dask is a good place to start investigating. ● YARN complicates matters ○ But I don’t know that something else (Kubernetes) would be better ○ The Dask website has good documentation ○ The Dask maintainers answer Stackoverflow questions quickly. ○ Dask is a complex library with lots of different abilities. This was just one use- case among many. ○ We’re hiring! #25
  • 26.
  • 27.
    @gallamine Debugging Wrapper -Appendix class DebugClient: def submit(self, func, *args, **kwargs): f = futures.Future() try: f.set_result(self._execute_function(func, *args, **kwargs)) return f except Exception as e: f.set_exception(e) return f def _execute_function(self, func, *args, **kwargs): try: return func(*args, **kwargs) except Exception: raise #27 def as_completed(fcast_futures, with_results): if not config.dask_debug_mode: return dask_as_completed(fcast_futures, with_results=with_results) else: return list(fcast_futures)
  • 28.
    @gallamine ● “Dask isreally just a smashing together of Python’s networking stack with its data science stack. Most of the work was already done by the time we got here.” - M. Rocklin #28 https://notamonadtutorial.com/interview-with-dasks-creator-scale-your-python-from-one-computer-to-a-thousand-b4483376f200