Distributed ML with
Dask & Kubernetes
Ray Hilton, Eliiza
@rayh @EliizaAI
What is Machine Learning?
COMPUTE
LOGIC
DATA
OUTPUT
COMPUTE
(LOTS AND LOTS OF)
OUTPUT
DATA
LOGIC
Traditional Software
Machine Learning
COMPUTE
(LOTS AND
LOTS OF)
LABELS/OUTPUT
TRAINING DATA
LOGIC
Learning
COMPUTE
(NOT MUCH
OF)
OUTPUT
RUNTIME DATA
Inference
DATA
SCIENTIST’S
BRAIN
ENGINEER’S
BRAIN
REQUIREMENTS LOGIC
Engineering
COMPUTE OUTPUT
RUNTIME DATA
RuntimeBUSINESS
REQUIREMENTS
Make predictions based on
previous experience
What is Dask?
It’s like Spark,
but idiomatically Python
“Dask uses existing Python
APIs and data structures to
make it easy to switch
between Numpy, Pandas,
Scikit-learn to their Dask-
powered equivalents.”
f(df)
f(df1)
f(df2)
f(df3)
f(df4)
f(df5)
MAP
REDUCE
result
What functions do we apply
where?
Directed
Acyclic
Graph
Basic DAG
@delayed
def add(x, y):
return x + y
four = add(
add(1, 1),
add(1, 1)
)
four.compute()
Complex DAG
Dask makes scaling data
operations easy*
*YMMV
Why?
â—Ź Open Source
â—Ź Defacto Standard
â—Ź Proven at Scale
â—Ź Infrastructure as Configuration
â—Ź Modular & Extensible
â—Ź Efficiency
Example Architecture
Node1 Node2 Node3 Node4 Node5
Kubernetes
Jupyter
Airflow
Dask
Grafana
Airflow
Dask
Grafana
Spark
Dask
Grafana
Spark
Dask
Jupyter
Spark
Dask
CPU
DISK
GPU
CPU
DISK
GPU
CPU
DISK
GPU
CPU
DISK
GPU
CPU
DISK
GPU
Kubernetes makes deployment
and orchestration easy and
efficient
Dask Cluster
worker:
image:
repository: eliiza/dsp-dask
tag: latest
pullPolicy: Always
replicas: 10
resources:
limits:
cpu: 2
memory: 6G
requests:
cpu: 2
memory: 6G
scheduler:
image:
repository: eliiza/dsp-dask
tag: latest
pullPolicy: Always
jupyter:
enabled: false
$ helm upgrade --install dask-cluster stable/dask -f config.yml
Demo Cluster
Demo: Dataframes
2GB
Local Cluster Speed Up
Counts 56.23 10.46 5.38
Market Share 50.60 9.46 5.35
10GB
Local Cluster Speed Up
Counts 429.69 73.74 5.83
Market Share 382.01 64.60 5.91
Demo: Monte Carlo Simulation
Demo: Random Forest
Question:
How do you know which model
architecture to use?
Answer:
Try random shit until shit looks
right
Answer:
Hyperparameter Search
Demo: RandomSearch & Dask
Learnings
Why no TensorFlow Love?
Xgboost
Fewer large nodes > many
small nodes
Diagnosing Graphs
What’s Next?
RAPIDS
TensorFlow
TF2 & AutoKeras:
Watch This Space
Thank You
@rayh
elz.ai/dask-ml
Questions?

Distributed ML with Dask and Kubernetes

Editor's Notes

  • #2 Thanks Derek & Melbourn Distributed
  • #3 So, in this talk, I’ll briefly explain: Machine learning Dask, and how it works Kubernetes And then we will work through some examples (demo gods permitting) I’ll be touching on a lot of disparate areas so i will try and keep it relatively high level But I’m going to assume at least some passing knowledge of these areas Feel free to ask for clarification along the way - but please save the bigger ones for the end
  • #4 First...
  • #7 So what does this mean in practice?
  • #8 Traditionally, Humans create the logic, in ML, humans curate the data and the desired output state - and the machines derive the logic As a side note, this doesnt remove the need for humans from the development process, it just shifts their role to one of data wrangling and curation and modelling of expected system output. The logic that is output by the ML training process can then be used
  • #9 More generally, the difference is that an engineer turns requirements into logic A data scientist turns requirements into training/test data and expected output (labels) This approach can be applied to problems that are too hard for mere mortal engineers, such as object detection in images and robust reading (formally known as OCR)
  • #10 Essentially, the power of machine learning that it enables us to make predictions based on previous experience, without us humans having to necessarily understand the underlying relationships
  • #11 First...
  • #12 Dask is a distributed processing library for Python. It provides a pandas-compatible API to easily perform operations on massive dataframes across many nodes
  • #13 But it doesnt support SQL, hdfs, hive, etc
  • #14 You don't have to completely rewrite your code or retrain to scale up.
  • #15 Imagine we have a set of panda dataframes, you can think of them as sets of structured data, and they are broken up by date. These dataframes could be processed by many threads or processes at once, perhaps across many machines. With appropriate partitioning, this would allow for massive concurrency. So how can we process data in parallel
  • #16  Image we have some linear function f() that we want to apply to all the data. … that is to say, a function that is applied per element and has no side-effects or dependencies We could send this function to each each dataframe and apply, or “map” it in parallel. Once all those functions have been applied, we can gather, or “reduce” the results
  • #17 So, how do we work out what functions to apply? Let’s start with what a DAG is
  • #19  Directed: flows in one direction Acyclic: it doesn’t have any loops Graph: a general topology primitives A directed acyclic graph (DAG) is commonly used to solve task scheduling problems. By breaking complex tasks into a DAG, a scheduler can scale work across a cluster Dask is a library for delayed task computation that makes use of directed graphs at its core.
  • #20 # FROM https://matthewrocklin.com/blog/work/2018/02/09/credit-models-with-dask @delayed def add(x, y): return x + y four = add(add(1, 1), add(1, 1)) four.compute()
  • #21 Here we can see a larger DAG - it’s clear that there is an opportunity for concurrency at the bottom, where operations have no or less dependencies. As the task nears completion, it is performing a simpler set of operations on a larger set of data And there is less opportunity for concurrency Ideally, we want to avoid “reducing” until as late as possible
  • #22 If you take advantage of dask primitives (bags, arrays, dataframes, delayed functions) - and keep in mind how your operation will be decomposed and distributed - you can, in some cases, effectively linear scaling (see monte carlo)
  • #23 First...
  • #24 Google released this to the community, since then many people have contributed work to it, or to it’s ecosystem It’s becoming an defacto standard, every cloud provider has some kind of managed Kubernetes service Kube can scale to large numbers of nodes and complex configurations. Desired Infrastructure state is described in simple YAML files - kube attempts to satisfy that state If kubernetes doesnt support something “out of the box” it can be extended through things like CRD/Operator, CSI, etc Instead of deploying & managing many clusters for different purposes (EMR, storage, API/web hosting, batch jobs) - we can use a single underlying cluster and make more efficient use of the resources
  • #25 We’re running Dask on Kubernetes here. This allows us to use the same underlying compute cluster for a variety of tasks such as notebooks (such as what you will see soon) and other compute (such as TensorFlow, Spark, etc)
  • #26 Node resources can be used for many purposes
  • #27 Now we get to the awesome
  • #29 This is the helm config for deploying the dask cluster You can see we specify memory/cpu limits as well as the number of nodes we want The underlying cluster will autoscale to accomodate for the desired compute We also have our custom dask image here, which has a lot of python packages pre-installed as well as things like CUDA drivers, etc
  • #30 Deploying using helm is pretty simple
  • #31 10 nodes, 2 CPUs and 6GB each
  • #34 We could make this go even faster: More cores Convert from CSV to parquet
  • #36 This shows how the dask structured the DAG: Apply lambda (i.e. run simulation) Get Item Count & Sum Then the reduce steps: Aggregate Counts & Sums Mean
  • #37 This shows how the dask structured the DAG: Apply lambda (i.e. run simulation) Get Item Count & Sum Then the reduce steps: Aggregate Counts & Sums Mean
  • #38 100,000 iterations, not 10,000 This took over a thousand seconds on a local low-power machine, but came down to 11s when running on a 128 core cluster (c5.4xlarge instances). The linearity fell off towards the end as the time taken to distributed tasks and gather results took about 8 seconds
  • #39 Logistic and XGBoost
  • #43 But that doesnt sound too good, so we use the fancier term
  • #44 Or hyper parameter optimisation There are a number of different algorithms, but for the general case, nothing beats RandomSearch (see Patrick’s Talk) No free lunch theorm
  • #45 Or hYper parameter optimisation There are a number of different algorithms, but for the general case, nothing beats RandomSearch (see Patrick’s Talk)
  • #49 TensorFlow support is dask has been abandoned! Tensorflow is quite hard to scale as we have to be quite explicit about how the graph scales onto multiple CPUs and GPUs With TF2, we have distribution strategies that will make it easy to copy the graph to many nodes, and process batches of training data on each node, and then combine the results.
  • #50 Dask xgboost is broken on kubernetes right now. While trying to get this to work, I realised the issue is being actively discussed, the last comment from just a few days ago…. this is bleeding edge stuff
  • #51 Running many pods on one large machine gives greater opportunity to burst to use under-utilised resources, where as smaller nodes tend remain under-utilised as you can only fit a couple of pods on them
  • #52 can be hard to understand how close maps to graphs Have to try different approaches (sre month Carlo)
  • #55 Matthew Rocklin, who made dask now works for Nvidia And Nvidia have created an “open” ecosystem for doing ML on GPUs cuDNN sounds very intet
  • #59 We are quite heavy users of Keras & TensorFlow So...
  • #60 With TF2, we have distribution strategies that will make it easy to copy the graph to many nodes, and process batches of training data on each node, and then combine the results. With AutoKeras we have a way of using performing search across TF architectures - this is generically much easier to parallelize than the model itself It currenl uses pytorch.multiprocessing as a backend and it seems possible to refactor this to use joblib, and thus dask
  • #61 https://rapids.ai/index.html https://github.com/nvidia/nvidia-docker https://github.com/rapidsai/cudf