PPTX, PDF1,000 views

Distributed ML with Dask and Kubernetes

The document discusses the use of Dask with Kubernetes for distributed machine learning, highlighting Dask's compatibility with Python libraries and its efficiency in scaling data operations. It outlines architectural examples, deployment processes, and performance metrics, emphasizing the ease of orchestration with Kubernetes. Additionally, it touches on model architecture selection, hyperparameter searches, and future technologies like Rapids and TensorFlow integration.

Technology◦

Distributed ML with Dask and Kubernetes

1.
Distributed ML with Dask& Kubernetes Ray Hilton, Eliiza @rayh @EliizaAI
4.
What is MachineLearning?
5.
COMPUTE LOGIC DATA OUTPUT COMPUTE (LOTS AND LOTSOF) OUTPUT DATA LOGIC Traditional Software Machine Learning
6.
COMPUTE (LOTS AND LOTS OF) LABELS/OUTPUT TRAININGDATA LOGIC Learning COMPUTE (NOT MUCH OF) OUTPUT RUNTIME DATA Inference DATA SCIENTIST’S BRAIN ENGINEER’S BRAIN REQUIREMENTS LOGIC Engineering COMPUTE OUTPUT RUNTIME DATA RuntimeBUSINESS REQUIREMENTS
7.
Make predictions basedon previous experience
9.
What is Dask?
10.
It’s like Spark, butidiomatically Python
11.
“Dask uses existingPython APIs and data structures to make it easy to switch between Numpy, Pandas, Scikit-learn to their Dask- powered equivalents.”
13.
f(df) f(df1) f(df2) f(df3) f(df4) f(df5) MAP REDUCE result
14.
What functions dowe apply where?
16.
Directed Acyclic Graph
17.
Basic DAG @delayed def add(x,y): return x + y four = add( add(1, 1), add(1, 1) ) four.compute()
18.
Complex DAG
19.
Dask makes scalingdata operations easy* *YMMV
21.
Why? ● Open Source ●Defacto Standard ● Proven at Scale ● Infrastructure as Configuration ● Modular & Extensible ● Efficiency
22.
Example Architecture Node1 Node2Node3 Node4 Node5 Kubernetes Jupyter Airflow Dask Grafana Airflow Dask Grafana Spark Dask Grafana Spark Dask Jupyter Spark Dask CPU DISK GPU CPU DISK GPU CPU DISK GPU CPU DISK GPU CPU DISK GPU
23.
Kubernetes makes deployment andorchestration easy and efficient
25.
Dask Cluster
26.
worker: image: repository: eliiza/dsp-dask tag: latest pullPolicy:Always replicas: 10 resources: limits: cpu: 2 memory: 6G requests: cpu: 2 memory: 6G scheduler: image: repository: eliiza/dsp-dask tag: latest pullPolicy: Always jupyter: enabled: false
27.
$ helm upgrade--install dask-cluster stable/dask -f config.yml
28.
Demo Cluster
29.
Demo: Dataframes
31.
2GB Local Cluster SpeedUp Counts 56.23 10.46 5.38 Market Share 50.60 9.46 5.35 10GB Local Cluster Speed Up Counts 429.69 73.74 5.83 Market Share 382.01 64.60 5.91
32.
Demo: Monte CarloSimulation
36.
Demo: Random Forest
38.
Question: How do youknow which model architecture to use?
40.
Answer: Try random shituntil shit looks right
41.
Answer: Hyperparameter Search
42.
Demo: RandomSearch &Dask
44.
Learnings
45.
Why no TensorFlowLove?
46.
Xgboost
47.
Fewer large nodes> many small nodes
48.
Diagnosing Graphs
49.
What’s Next?
50.
RAPIDS
54.
TensorFlow
55.
TF2 & AutoKeras: WatchThis Space
56.
Thank You @rayh elz.ai/dask-ml
57.
Questions?

Editor's Notes

#2 Thanks Derek & Melbourn Distributed
#3 So, in this talk, I’ll briefly explain: Machine learning Dask, and how it works Kubernetes And then we will work through some examples (demo gods permitting) I’ll be touching on a lot of disparate areas so i will try and keep it relatively high level But I’m going to assume at least some passing knowledge of these areas Feel free to ask for clarification along the way - but please save the bigger ones for the end
#4 First...
#7 So what does this mean in practice?
#8 Traditionally, Humans create the logic, in ML, humans curate the data and the desired output state - and the machines derive the logic As a side note, this doesnt remove the need for humans from the development process, it just shifts their role to one of data wrangling and curation and modelling of expected system output. The logic that is output by the ML training process can then be used
#9 More generally, the difference is that an engineer turns requirements into logic A data scientist turns requirements into training/test data and expected output (labels) This approach can be applied to problems that are too hard for mere mortal engineers, such as object detection in images and robust reading (formally known as OCR)
#10 Essentially, the power of machine learning that it enables us to make predictions based on previous experience, without us humans having to necessarily understand the underlying relationships
#11 First...
#12 Dask is a distributed processing library for Python. It provides a pandas-compatible API to easily perform operations on massive dataframes across many nodes
#13 But it doesnt support SQL, hdfs, hive, etc
#14 You don't have to completely rewrite your code or retrain to scale up.
#15 Imagine we have a set of panda dataframes, you can think of them as sets of structured data, and they are broken up by date. These dataframes could be processed by many threads or processes at once, perhaps across many machines. With appropriate partitioning, this would allow for massive concurrency. So how can we process data in parallel
#16 Image we have some linear function f() that we want to apply to all the data. … that is to say, a function that is applied per element and has no side-effects or dependencies We could send this function to each each dataframe and apply, or “map” it in parallel. Once all those functions have been applied, we can gather, or “reduce” the results
#17 So, how do we work out what functions to apply? Let’s start with what a DAG is
#19 Directed: flows in one direction Acyclic: it doesn’t have any loops Graph: a general topology primitives A directed acyclic graph (DAG) is commonly used to solve task scheduling problems. By breaking complex tasks into a DAG, a scheduler can scale work across a cluster Dask is a library for delayed task computation that makes use of directed graphs at its core.
#20 # FROM https://matthewrocklin.com/blog/work/2018/02/09/credit-models-with-dask @delayed def add(x, y): return x + y four = add(add(1, 1), add(1, 1)) four.compute()
#21 Here we can see a larger DAG - it’s clear that there is an opportunity for concurrency at the bottom, where operations have no or less dependencies. As the task nears completion, it is performing a simpler set of operations on a larger set of data And there is less opportunity for concurrency Ideally, we want to avoid “reducing” until as late as possible
#22 If you take advantage of dask primitives (bags, arrays, dataframes, delayed functions) - and keep in mind how your operation will be decomposed and distributed - you can, in some cases, effectively linear scaling (see monte carlo)
#23 First...
#24 Google released this to the community, since then many people have contributed work to it, or to it’s ecosystem It’s becoming an defacto standard, every cloud provider has some kind of managed Kubernetes service Kube can scale to large numbers of nodes and complex configurations. Desired Infrastructure state is described in simple YAML files - kube attempts to satisfy that state If kubernetes doesnt support something “out of the box” it can be extended through things like CRD/Operator, CSI, etc Instead of deploying & managing many clusters for different purposes (EMR, storage, API/web hosting, batch jobs) - we can use a single underlying cluster and make more efficient use of the resources
#25 We’re running Dask on Kubernetes here. This allows us to use the same underlying compute cluster for a variety of tasks such as notebooks (such as what you will see soon) and other compute (such as TensorFlow, Spark, etc)
#26 Node resources can be used for many purposes
#27 Now we get to the awesome
#29 This is the helm config for deploying the dask cluster You can see we specify memory/cpu limits as well as the number of nodes we want The underlying cluster will autoscale to accomodate for the desired compute We also have our custom dask image here, which has a lot of python packages pre-installed as well as things like CUDA drivers, etc
#30 Deploying using helm is pretty simple
#31 10 nodes, 2 CPUs and 6GB each
#34 We could make this go even faster: More cores Convert from CSV to parquet
#36 This shows how the dask structured the DAG: Apply lambda (i.e. run simulation) Get Item Count & Sum Then the reduce steps: Aggregate Counts & Sums Mean
#37 This shows how the dask structured the DAG: Apply lambda (i.e. run simulation) Get Item Count & Sum Then the reduce steps: Aggregate Counts & Sums Mean
#38 100,000 iterations, not 10,000 This took over a thousand seconds on a local low-power machine, but came down to 11s when running on a 128 core cluster (c5.4xlarge instances). The linearity fell off towards the end as the time taken to distributed tasks and gather results took about 8 seconds
#39 Logistic and XGBoost
#43 But that doesnt sound too good, so we use the fancier term
#44 Or hyper parameter optimisation There are a number of different algorithms, but for the general case, nothing beats RandomSearch (see Patrick’s Talk) No free lunch theorm
#45 Or hYper parameter optimisation There are a number of different algorithms, but for the general case, nothing beats RandomSearch (see Patrick’s Talk)
#49 TensorFlow support is dask has been abandoned! Tensorflow is quite hard to scale as we have to be quite explicit about how the graph scales onto multiple CPUs and GPUs With TF2, we have distribution strategies that will make it easy to copy the graph to many nodes, and process batches of training data on each node, and then combine the results.
#50 Dask xgboost is broken on kubernetes right now. While trying to get this to work, I realised the issue is being actively discussed, the last comment from just a few days ago…. this is bleeding edge stuff
#51 Running many pods on one large machine gives greater opportunity to burst to use under-utilised resources, where as smaller nodes tend remain under-utilised as you can only fit a couple of pods on them
#52 can be hard to understand how close maps to graphs Have to try different approaches (sre month Carlo)
#55 Matthew Rocklin, who made dask now works for Nvidia And Nvidia have created an “open” ecosystem for doing ML on GPUs cuDNN sounds very intet
#59 We are quite heavy users of Keras & TensorFlow So...
#60 With TF2, we have distribution strategies that will make it easy to copy the graph to many nodes, and process batches of training data on each node, and then combine the results. With AutoKeras we have a way of using performing search across TF architectures - this is generically much easier to parallelize than the model itself It currenl uses pytorch.multiprocessing as a backend and it seems possible to refactor this to use joblib, and thus dask
#61 https://rapids.ai/index.html https://github.com/nvidia/nvidia-docker https://github.com/rapidsai/cudf

Distributed ML with Dask and Kubernetes

More Related Content

What's hot

Similar to Distributed ML with Dask and Kubernetes

Recently uploaded

Distributed ML with Dask and Kubernetes

Editor's Notes