Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

John Zedlewski, Devin Robison
Accelerating Machine Learning
with RAPIDS and MLflow

2
Outline
● RAPIDS for accelerated data science
● Why RAPIDS + MLflow?
● Example Integration
● Training and Deployment as an
MLproject

4
Pandas
Analytics
CPU Memory
Data Preparation VisualizationModel Training
Scikit-Learn
Machine Learning
NetworkX
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
Matplotlib
Visualization
Dask
Open Standards Data Science Ecosystem
Traditional Python APIs on CPU

5
cuDF cuIO
Analytics
cuML, XGBoost
Machine Learning
cuGraph
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
cuxfilter, pyViz,
plotly
Visualization
Dask
GPU Memory
RAPIDS
End-to-End GPU Accelerated Data Science

6
Dask
GPU Memory
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
cuxfilter, pyViz,
plotly
Visualization
RAPIDS ETL
GPU Accelerated Data Wrangling and Feature Engineering
cuDF cuIO
Analytics

7
25-100x Improvement
Less Code
Language Flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More Code
Language Rigid
Substantially on GPU
Traditional GPU Processing
Hadoop Processing, Reading from Disk
Spark In-Memory Processing
Data Processing Evolution
Faster Data Access, Less Data Movement

8
25-100x Improvement
Less Code
Language Flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More Code
Language Rigid
Substantially on GPU
Traditional GPU Processing
Hadoop Processing, Reading from Disk
Spark In-Memory Processing
Data Processing Evolution
Faster Data Access, Less Data Movement
RAPIDS
Arrow
Read
ETL
ML
Train
Query
50-100x Improvement
Same Code
Language Flexible
Primarily on GPU

9
GPU-Accelerated ETL
The Average Data Scientist Spends 90+% of Their Time in
ETL as Opposed to Training Models

10
Dask cuDF
cuDF
Pandas
Thrust
Cub
Jitify
Python
Cython
cuDF C++
CUDA Libraries
CUDA
ETL Technology Stack

11
ETL - the Backbone of Data Science
PYTHON LIBRARY
▸ A Python library for manipulating GPU
DataFrames following the Pandas API
▸ Python interface to CUDA C++ library with
additional functionality
▸ Creating GPU DataFrames from Numpy arrays,
Pandas DataFrames, and PyArrow Tables
▸ JIT compilation of User-Defined Functions
(UDFs) using Numba
▸ Most common formats: CSV, Parquet, ORC,
JSON, AVRO, HDF5, and more...
cuDF is…

12
Benchmarks: Single-GPU Speedup vs. Pandas
cuDF v0.13, Pandas 0.25.3
▸ Running on NVIDIA DGX-1:
▸ GPU: NVIDIA Tesla V100 32GB
▸ CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
▸ Benchmark Setup:
▸ RMM Pool Allocator Enabled
▸ DataFrames: 2x int32 columns key columns, 3x int32
value columns
▸ Merge: inner; GroupBy: count, sum, min, max
calculated for each value column
300
900
500
0
Merge Sort
GroupBy
GPUSpeedupOver
CPU
10M 100M
970
500
370
350
330 320

13
PyTorch,
TensorFlow, MxNet
Deep Learning
Dask
cuDF cuIO
Analytics
GPU Memory
cuGraph
Graph Analytics
cuxfilter, pyViz,
plotly
Visualization
Machine Learning with RAPIDS
More Models More Problems
cuML
Machine Learning

14
Dask cuML
Dask cuDF
cuDF
Numpy
Python
Thrust
Cub
cuSolver
nvGraph
CUTLASS
cuSparse
cuRand
cuBlas
Cython
cuML Algorithms
cuML Prims
CUDA Libraries
CUDA
ML Technology Stack

15
from sklearn.datasets import make_moons
import pandas
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X = pandas.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = dbscan.predict(X)
RAPIDS Matches Common Python APIs
CPU-accelerated Clustering

16
from sklearn.datasets import make_moons
import cudf
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X = cudf.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
from cuml import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = dbscan.predict(X)
RAPIDS Matches Common Python APIs
GPU-accelerated Clustering

17
Decision Trees / Random Forests
Linear/Lasso/Ridge/ElasticNet Regression
Logistic Regression
K-Nearest Neighbors
Support Vector Machine Classification and
Regression
Naive Bayes
K-Means
DBSCAN
Spectral Clustering
Principal Components
Singular Value Decomposition
UMAP
Spectral Embedding
T-SNE
Holt-Winters
Seasonal ARIMA / Auto ARIMA
More to come!
Random Forest / GBDT Inference
(FIL)
Time Series
Clustering
Decomposition &
Dimensionality Reduction
Preprocessing
Inference
Classification / Regression
Hyper-parameter Tuning
Cross Validation
Key:
Preexisting | NEW or enhanced for 0.15
Algorithms
GPU-accelerated Scikit-Learn
Text vectorization (TF-IDF / Count)
Target Encoding
Cross-validation / splitting

18
Benchmarks:
Single-GPU cuML vs Scikit-learn
1x V100 vs. 2x 20 Core CPUs (DGX-1, RAPIDS 0.15)

19
Forest Inference
cuML’s Forest Inference Library accelerates prediction
(inference) for random forests and boosted decision trees:
▸ Works with existing saved models
(XGBoost, LightGBM, scikit-learn RF, cuML RF)
▸ Lightweight Python API
▸ Single V100 GPU can infer up to 34x faster than
XGBoost dual-CPU node
▸ Over 100 million forest inferences/sec on a DGX-1V
Taking Models From Training to Production
4000
3000
2000
1000
0
Bosch Airline Epsilon
Time(ms)
CPU Time (XGBoost, 40 Cores) FIL GPU Time (1x V100)
Higgs
XGBoost CPU Inference vs. FIL GPU (1000 trees)
23x
36x
34x
23x

20
XGBoost + RAPIDS: Better Together
▸ RAPIDS comes paired with XGBoost 1.2 (as of
0.15)
▸ XGBoost now builds on the GoAI interface
standards to provide zero-copy data import from
cuDF, cuPY, Numba, PyTorch and more
▸ Official Dask API makes it easy to scale to
multiple nodes or multiple GPUs
▸ gpu_hist tree builder delivers huge perf gains
Memory usage when importing GPU data
decreased by 2/3 or more
▸ New objectives support Learning to Rank on GPU
All RAPIDS changes are integrated upstream and provided to
all XGBoost users – via pypi or RAPIDS conda

21
https://github.com/rapidsai https://medium.com/rapids-ai
Explore: RAPIDS Getting Started, Code, and Blogs
From intro to in-depth
https://rapids.ai

22
Exactly as it sounds—our goal is to make
RAPIDS as usable and performant as
possible wherever data science is done.
We will continue to work with more open
source projects to further democratize
acceleration and efficiency in data science.
RAPIDS Everywhere
The Next Phase of RAPIDS

24
“... an open source platform to manage the ML lifecycle,
including experimentation, reproducibility, deployment,
and a central model registry.”
- mlﬂow.org
…. And it works with RAPIDS, out of the box!
MLflow

25
Why RAPIDS + MLflow?
RAPIDS substantial speedups across a wide range of machine learning and ETL tasks, SKlearn
compatible API.
MLflow improved collaboration, experiment tracking, model storage, registration, and
deployment.
Production /
Engineering
Update
Good?
Training
ValidateUpdate

26
HPO Use Case: 100-Job Random Forest Airline Model
Huge speedups translate into >7x TCO reduction
Based on sample Random Forest training code from cloud-ml-examples repository, running on Azure ML. 10 concurrent workers with 100 total runs, 100M rows, 5-fold cross-validation per run.
GPU nodes: 10x Standard_NC6s_v3, 1 V100 16G, vCPU 6 memory 112G, Xeon E5-2690 v4 (Broadwell) - $3.366/hour
CPU nodes: 10x Standard_DS5_v2, vCPU 16 memory 56G, Xeon E5-2673 v3 (Haswell) or v4 (Broadwell) - $1.017/hour"
Cost
Time(hours)

27
Integration and Training:
Nested HPO Experiments
}
Parent Experiment
Child HPO Runs
Accuracy Metric
Conﬁguration Parameters
Metadata: Tags

28
Component Overview:
Some Terminology
Local File
System
Backend Store
Artifact Store
/tmp/...
/

29
A Quick Example:
Convert an Existing Project
29
Conversion to RAPIDS and MLflow
Add nesting+HPO and model logging
Add project entry points
Anaconda and Docker training
Deployment A Trained Model

30
Basic Conversion
from sklearn.ensemble import RandomForestClassifier
def train(fpath, max_depth, max_features, n_estimators):
X_train, X_test, y_train, y_test = load_data(fpath)
mod = RandomForestClassifier(
max_depth=max_depth,
max_features=max_features,
n_estimators=n_estimators
)
mod.fit(X_train, y_train)
preds = mod.predict(X_test)
accuracy = accuracy_score(y_test, preds)
return mod, accuracy
Start MLﬂow ‘run’
from cuml.ensemble import RandomForestClassifier
def train(fpath, max_depth, max_features, n_estimators):
X_train, X_test, y_train, y_test = load_data(fpath)
with mlflow.start_run(run_name="RAPIDS-MLFlow"):
mlparams = {
"max_depth": str(max_depth),
"max_features": str(max_features),
"n_estimators": str(n_estimators),
}
mlflow.log_params(mlparams)
max_depth=max_depth,
max_features=max_features,
n_estimators=n_estimators
)
accuracy = accuracy_score(y_test, preds)
mlflow.log_metric("accuracy", accuracy)
return mod
Record
Parameters
Record Performance
Metrics
Unmodiﬁed Training Code
Augmented Training Code
SKlearn to cuML

31
Integration:
Nesting+HPO and Model Logging
hpo_runner = HPO_Runner(hpo_train)
with mlflow.start_run(run_name=f"RAPIDS-HPO", nested=True):
search_space = [
uniform("max_depth", 5, 20),
uniform("max_features", 0.1, 1.0),
uniform("n_estimators", 150, 1000),
]
hpo_results = hpo_runner(fpath, search_space)
artifact_path = "rapids-mlflow-example"
with mlflow.start_run(run_name='Final Classifier', nested=True):
mlflow.sklearn.log_model(hpo_results.best_model,
artifact_path=artifact_path,
registered_model_name="rapids-mlflow-example",
conda_env='conda/conda.yaml')
from cuml.ensemble import RandomForestClassifier
from your_hpo_library import HPO_Runner
# Called by hpo_runner
def hpo_train(params):
X_train, X_test, y_train, y_test = load_data(params.fpath)
with mlflow.start_run(run_name=f”Trial {params.trail}",
nested=True):
max_depth=params.max_depth,
max_features=params.max_features,
n_estimators=params.n_estimators
)
accuracy = accuracy_score(y_test, preds
return mod, accuracy
Add HPO Runner
Log Runs and Best Model
Import our HPO library
Update Nested Training
Register Best Result

32
Packaging Your Environment
./
├── airline_small.parquet
├── envs
└── conda.yaml
├── Dockerfile.training
├── MLproject
├── README.md
└── src
├── entrypoint.sh
└── train.py
Project Environment
MLProject
name: rapids-mlflow
docker_env:
image: mlflow-rapids-example
entry_points:
hpo_run:
parameters:
fpath: {type: str}
n_estimators: {type: int, default: 100}
max_features: {type: float}
max_depth: {type: int}
command: "/bin/bash src/entrypoint.sh src/train.py
--fpath={fpath} --n_estimators={n_estimators}
--max_features={max_features} --max_depth={max_depth}"
name: rapids-mlflow
conda_env: envs/conda.yaml
entry_points:
hpo_run:
parameters:
fpath: {type: str}
n_estimators: {type: int, default: 100}
max_features: {type: float}
max_depth: {type: int}
command: "python src/train.py
--fpath={fpath} --n_estimators={n_estimators}
--max_features={max_features} --max_depth={max_depth}"
MLProject (Anaconda) MLProject (Docker/K8s)
FROM rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8
RUN source activate rapids
&& pip install mlflow
Dockerﬁle.training
$ conda env export --name mlflow > envs/conda.yaml
Conda Export
Conda Path
Docker Path

33
Bringing Things Together
## New conda environment
$ conda create --name mlflow python=3.8
....
$ conda activate mlflow
## Install mlflow libs/tools -- this gives us the mlflow util
$ pip install mlflow
## Create a training run with ‘mlflow run’
$ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite
## Train in a custom Conda Environment
$ mlflow run --experiment-name "RAPIDS-MLflow-Conda"
--entry-point hpo_run ./
....
Created version '10' of model 'rapids_mlflow_cli'.
Model uri:
./mlruns/3/c20642df4137490fba2ca96a7b4431b0/artifacts/Airline-De
mo
2020/09/29 23:36:37 INFO mlflow.projects: === Run (ID
'c20642df4137490fba2ca96a7b4431b0') succeeded ==
Anaconda
## New conda environment
$ conda create --name mlflow python=3.8
....
$ conda activate mlflow
## Install mlflow libs/tools -- this gives us the mlflow util
$ pip install mlflow
## Export our conda environment so we can deploy later
$ docker build --tag mlflow-rapids-example --file
./Dockerfile.training ./
....
## Create a training run with ‘mlflow run’
$ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite
$ mlflow run --experiment-name "RAPIDS-MLflow-Docker"
--entry-point hpo_run ./
Docker
$ vi /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": { .... }
}
}
Nvidia-Docker

34
Nested HPO Experiments
}
Parent Experiment
Child HPO Runs
Accuracy Metric
Conﬁguration Parameters
Metadata: Tags

35
Model Deployment
$ mlflow models serve -m models:/rapids_mlflow_cli/1 -p 56767
2020/09/24 18:05:26 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2020/09/24 18:05:26 INFO mlflow.pyfunc.backend: === Running command 'gunicorn --timeout=60 -b 127.0.0.1:56767 -w 1 ${GUNICORN_CMD_ARGS} --
mlflow.pyfunc.scoring_server.wsgi:app'
[2020-09-24 18:05:26 -0600] [17024] [INFO] Starting gunicorn 20.0.4
[2020-09-24 18:05:26 -0600] [17024] [INFO] Listening at: http://127.0.0.1:56767 (17024)
[2020-09-24 18:05:26 -0600] [17024] [INFO] Using worker: sync
[2020-09-24 18:05:26 -0600] [17026] [INFO] Booting worker with pid: 17026
[2020-09-24 18:05:28 -0600] [17024] [INFO] Handling signal: winch
Registered Model
Anaconda
This can also be a storage path.
Query Request
Docker Serving (Experimental)
$ mlflow models build-docker -m models:/rapids_mlflow_cli/9 -n mlflow-rapids-example
2020/09/24 16:43:18 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2020/09/24 16:43:18 INFO mlflow.models.docker_utils: Building docker image with name mlflow-rapids-example
…. build process ….
Successfully built 900f8e84b370
Successfully tagged mlflow-rapids-example:latest
$
Registered Model
EXPERIM
ENTAL

36
Endpoint Inference
import json
import requests
host = 'localhost'
port = '56767'
headers = {
"Content-Type": "application/json",
"format": "pandas-split"
}
data = {
"columns": ["Year", "Month", "DayofMonth", "DayofWeek", "CRSDepTime",
"CRSArrTime", "UniqueCarrier",
"FlightNum", "ActualElapsedTime", "Origin", "Dest", "Distance",
"Diverted"],
"data": [[1987, 10, 1, 4, 1, 556, 0, 190, 247, 202, 162, 1846, 0]]
}
resp = requests.post(url="http://%s:%s/invocations" % (host, port),
data=json.dumps(data), headers=headers)
print('Classification: %s' % ("ON-Time" if resp.text == "[0.0]" else "LATE"))
test_query.py
$ python src/rf_test/test_query.py
Classification: ON-Time
Shell

37
RAPIDS Cloud-ML Examples
https://github.com/rapidsai/cloud-ml-examples
RAPIDS + MLflow All-In-One Deployments
(coming soon!)
RAPIDS Cloud Notebooks
Amazon AWS, Databricks, Microsoft Azure, Google GCP
RAPIDS Platform Integration
SageMaker, AzureML, Google AI Platform
RAPIDS Framework Integration
DASK, MLflow, Optuna, RayTune

Thank you!
Find us on Twitter: @rapidsai

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

Similar to Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS