John Zedlewski, Devin Robison
Accelerating Machine Learning
with RAPIDS and MLflow
2
Outline
● RAPIDS for accelerated data science
● Why RAPIDS + MLflow?
● Example Integration
● Training and Deployment as an
MLproject
3
What is RAPIDS?
4
Pandas
Analytics
CPU Memory
Data Preparation VisualizationModel Training
Scikit-Learn
Machine Learning
NetworkX
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
Matplotlib
Visualization
Dask
Open Standards Data Science Ecosystem
Traditional Python APIs on CPU
5
cuDF cuIO
Analytics
Data Preparation VisualizationModel Training
cuML, XGBoost
Machine Learning
cuGraph
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
cuxfilter, pyViz,
plotly
Visualization
Dask
GPU Memory
RAPIDS
End-to-End GPU Accelerated Data Science
6
Dask
GPU Memory
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch,
TensorFlow, MxNet
Deep Learning
cuxfilter, pyViz,
plotly
Visualization
RAPIDS ETL
GPU Accelerated Data Wrangling and Feature Engineering
cuDF cuIO
Analytics
7
25-100x Improvement
Less Code
Language Flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More Code
Language Rigid
Substantially on GPU
Traditional GPU Processing
Hadoop Processing, Reading from Disk
Spark In-Memory Processing
Data Processing Evolution
Faster Data Access, Less Data Movement
8
25-100x Improvement
Less Code
Language Flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More Code
Language Rigid
Substantially on GPU
Traditional GPU Processing
Hadoop Processing, Reading from Disk
Spark In-Memory Processing
Data Processing Evolution
Faster Data Access, Less Data Movement
RAPIDS
Arrow
Read
ETL
ML
Train
Query
50-100x Improvement
Same Code
Language Flexible
Primarily on GPU
9
GPU-Accelerated ETL
The Average Data Scientist Spends 90+% of Their Time in
ETL as Opposed to Training Models
10
Dask cuDF
cuDF
Pandas
Thrust
Cub
Jitify
Python
Cython
cuDF C++
CUDA Libraries
CUDA
ETL Technology Stack
11
ETL - the Backbone of Data Science
PYTHON LIBRARY
▸ A Python library for manipulating GPU
DataFrames following the Pandas API
▸ Python interface to CUDA C++ library with
additional functionality
▸ Creating GPU DataFrames from Numpy arrays,
Pandas DataFrames, and PyArrow Tables
▸ JIT compilation of User-Defined Functions
(UDFs) using Numba
▸ Most common formats: CSV, Parquet, ORC,
JSON, AVRO, HDF5, and more...
cuDF is…
12
Benchmarks: Single-GPU Speedup vs. Pandas
cuDF v0.13, Pandas 0.25.3
▸ Running on NVIDIA DGX-1:
▸ GPU: NVIDIA Tesla V100 32GB
▸ CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
▸ Benchmark Setup:
▸ RMM Pool Allocator Enabled
▸ DataFrames: 2x int32 columns key columns, 3x int32
value columns
▸ Merge: inner; GroupBy: count, sum, min, max
calculated for each value column
300
900
500
0
Merge Sort
GroupBy
GPUSpeedupOver
CPU
10M 100M
970
500
370
350
330 320
13
PyTorch,
TensorFlow, MxNet
Deep Learning
Dask
cuDF cuIO
Analytics
GPU Memory
Data Preparation VisualizationModel Training
cuGraph
Graph Analytics
cuxfilter, pyViz,
plotly
Visualization
Machine Learning with RAPIDS
More Models More Problems
cuML
Machine Learning
14
Dask cuML
Dask cuDF
cuDF
Numpy
Python
Thrust
Cub
cuSolver
nvGraph
CUTLASS
cuSparse
cuRand
cuBlas
Cython
cuML Algorithms
cuML Prims
CUDA Libraries
CUDA
ML Technology Stack
15
from sklearn.datasets import make_moons
import pandas
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X = pandas.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = dbscan.predict(X)
RAPIDS Matches Common Python APIs
CPU-accelerated Clustering
16
from sklearn.datasets import make_moons
import cudf
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X = cudf.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
from cuml import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = dbscan.predict(X)
RAPIDS Matches Common Python APIs
GPU-accelerated Clustering
17
Decision Trees / Random Forests
Linear/Lasso/Ridge/ElasticNet Regression
Logistic Regression
K-Nearest Neighbors
Support Vector Machine Classification and
Regression
Naive Bayes
K-Means
DBSCAN
Spectral Clustering
Principal Components
Singular Value Decomposition
UMAP
Spectral Embedding
T-SNE
Holt-Winters
Seasonal ARIMA / Auto ARIMA
More to come!
Random Forest / GBDT Inference
(FIL)
Time Series
Clustering
Decomposition &
Dimensionality Reduction
Preprocessing
Inference
Classification / Regression
Hyper-parameter Tuning
Cross Validation
Key:
Preexisting | NEW or enhanced for 0.15
Algorithms
GPU-accelerated Scikit-Learn
Text vectorization (TF-IDF / Count)
Target Encoding
Cross-validation / splitting
18
Benchmarks:
Single-GPU cuML vs Scikit-learn
1x V100 vs. 2x 20 Core CPUs (DGX-1, RAPIDS 0.15)
19
Forest Inference
cuML’s Forest Inference Library accelerates prediction
(inference) for random forests and boosted decision trees:
▸ Works with existing saved models
(XGBoost, LightGBM, scikit-learn RF, cuML RF)
▸ Lightweight Python API
▸ Single V100 GPU can infer up to 34x faster than
XGBoost dual-CPU node
▸ Over 100 million forest inferences/sec on a DGX-1V
Taking Models From Training to Production
4000
3000
2000
1000
0
Bosch Airline Epsilon
Time(ms)
CPU Time (XGBoost, 40 Cores) FIL GPU Time (1x V100)
Higgs
XGBoost CPU Inference vs. FIL GPU (1000 trees)
23x
36x
34x
23x
20
XGBoost + RAPIDS: Better Together
▸ RAPIDS comes paired with XGBoost 1.2 (as of
0.15)
▸ XGBoost now builds on the GoAI interface
standards to provide zero-copy data import from
cuDF, cuPY, Numba, PyTorch and more
▸ Official Dask API makes it easy to scale to
multiple nodes or multiple GPUs
▸ gpu_hist tree builder delivers huge perf gains
Memory usage when importing GPU data
decreased by 2/3 or more
▸ New objectives support Learning to Rank on GPU
All RAPIDS changes are integrated upstream and provided to
all XGBoost users – via pypi or RAPIDS conda
21
https://github.com/rapidsai https://medium.com/rapids-ai
Explore: RAPIDS Getting Started, Code, and Blogs
From intro to in-depth
https://rapids.ai
22
Exactly as it sounds—our goal is to make
RAPIDS as usable and performant as
possible wherever data science is done.
We will continue to work with more open
source projects to further democratize
acceleration and efficiency in data science.
RAPIDS Everywhere
The Next Phase of RAPIDS
23
MLflow + RAPIDS
24
“... an open source platform to manage the ML lifecycle,
including experimentation, reproducibility, deployment,
and a central model registry.”
- mlflow.org
…. And it works with RAPIDS, out of the box!
MLflow
25
Why RAPIDS + MLflow?
RAPIDS substantial speedups across a wide range of machine learning and ETL tasks, SKlearn
compatible API.
MLflow improved collaboration, experiment tracking, model storage, registration, and
deployment.
Production /
Engineering
Update
Good?
Training
ValidateUpdate
26
HPO Use Case: 100-Job Random Forest Airline Model
Huge speedups translate into >7x TCO reduction
Based on sample Random Forest training code from cloud-ml-examples repository, running on Azure ML. 10 concurrent workers with 100 total runs, 100M rows, 5-fold cross-validation per run.
GPU nodes: 10x Standard_NC6s_v3, 1 V100 16G, vCPU 6 memory 112G, Xeon E5-2690 v4 (Broadwell) - $3.366/hour
CPU nodes: 10x Standard_DS5_v2, vCPU 16 memory 56G, Xeon E5-2673 v3 (Haswell) or v4 (Broadwell) - $1.017/hour"
Cost
Time(hours)
27
Integration and Training:
Nested HPO Experiments
}
Parent Experiment
Child HPO Runs
Accuracy Metric
Configuration Parameters
Metadata: Tags
28
Component Overview:
Some Terminology
Local File
System
Backend Store
Artifact Store
/tmp/...
/
29
A Quick Example:
Convert an Existing Project
29
Conversion to RAPIDS and MLflow
Add nesting+HPO and model logging
Add project entry points
Anaconda and Docker training
Deployment A Trained Model
30
Integration and Training:
Basic Conversion
from sklearn.ensemble import RandomForestClassifier
def train(fpath, max_depth, max_features, n_estimators):
X_train, X_test, y_train, y_test = load_data(fpath)
mod = RandomForestClassifier(
max_depth=max_depth,
max_features=max_features,
n_estimators=n_estimators
)
mod.fit(X_train, y_train)
preds = mod.predict(X_test)
accuracy = accuracy_score(y_test, preds)
return mod, accuracy
Start MLflow ‘run’
from cuml.ensemble import RandomForestClassifier
def train(fpath, max_depth, max_features, n_estimators):
X_train, X_test, y_train, y_test = load_data(fpath)
with mlflow.start_run(run_name="RAPIDS-MLFlow"):
mlparams = {
"max_depth": str(max_depth),
"max_features": str(max_features),
"n_estimators": str(n_estimators),
}
mlflow.log_params(mlparams)
mod = RandomForestClassifier(
max_depth=max_depth,
max_features=max_features,
n_estimators=n_estimators
)
mod.fit(X_train, y_train)
preds = mod.predict(X_test)
accuracy = accuracy_score(y_test, preds)
mlflow.log_metric("accuracy", accuracy)
return mod
Record
Parameters
Record Performance
Metrics
Unmodified Training Code
Augmented Training Code
SKlearn to cuML
31
Integration:
Nesting+HPO and Model Logging
hpo_runner = HPO_Runner(hpo_train)
with mlflow.start_run(run_name=f"RAPIDS-HPO", nested=True):
search_space = [
uniform("max_depth", 5, 20),
uniform("max_features", 0.1, 1.0),
uniform("n_estimators", 150, 1000),
]
hpo_results = hpo_runner(fpath, search_space)
artifact_path = "rapids-mlflow-example"
with mlflow.start_run(run_name='Final Classifier', nested=True):
mlflow.sklearn.log_model(hpo_results.best_model,
artifact_path=artifact_path,
registered_model_name="rapids-mlflow-example",
conda_env='conda/conda.yaml')
from cuml.ensemble import RandomForestClassifier
from your_hpo_library import HPO_Runner
# Called by hpo_runner
def hpo_train(params):
X_train, X_test, y_train, y_test = load_data(params.fpath)
with mlflow.start_run(run_name=f”Trial {params.trail}",
nested=True):
mod = RandomForestClassifier(
max_depth=params.max_depth,
max_features=params.max_features,
n_estimators=params.n_estimators
)
mod.fit(X_train, y_train)
preds = mod.predict(X_test)
accuracy = accuracy_score(y_test, preds
return mod, accuracy
Add HPO Runner
Log Runs and Best Model
Import our HPO library
Update Nested Training
Register Best Result
32
Integration and Training:
Packaging Your Environment
./
├── airline_small.parquet
├── envs
└── conda.yaml
├── Dockerfile.training
├── MLproject
├── README.md
└── src
├── entrypoint.sh
└── train.py
Project Environment
MLProject
name: rapids-mlflow
docker_env:
image: mlflow-rapids-example
entry_points:
hpo_run:
parameters:
fpath: {type: str}
n_estimators: {type: int, default: 100}
max_features: {type: float}
max_depth: {type: int}
command: "/bin/bash src/entrypoint.sh src/train.py 
--fpath={fpath} --n_estimators={n_estimators} 
--max_features={max_features} --max_depth={max_depth}"
name: rapids-mlflow
conda_env: envs/conda.yaml
entry_points:
hpo_run:
parameters:
fpath: {type: str}
n_estimators: {type: int, default: 100}
max_features: {type: float}
max_depth: {type: int}
command: "python src/train.py 
--fpath={fpath} --n_estimators={n_estimators} 
--max_features={max_features} --max_depth={max_depth}"
MLProject (Anaconda) MLProject (Docker/K8s)
FROM rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8
RUN source activate rapids 
&& pip install mlflow
Dockerfile.training
$ conda env export --name mlflow > envs/conda.yaml
Conda Export
Conda Path
Docker Path
33
Integration and Training:
Bringing Things Together
## New conda environment
$ conda create --name mlflow python=3.8
....
$ conda activate mlflow
## Install mlflow libs/tools -- this gives us the mlflow util
$ pip install mlflow
## Create a training run with ‘mlflow run’
$ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite
## Train in a custom Conda Environment
$ mlflow run --experiment-name "RAPIDS-MLflow-Conda" 
--entry-point hpo_run ./
....
Created version '10' of model 'rapids_mlflow_cli'.
Model uri:
./mlruns/3/c20642df4137490fba2ca96a7b4431b0/artifacts/Airline-De
mo
2020/09/29 23:36:37 INFO mlflow.projects: === Run (ID
'c20642df4137490fba2ca96a7b4431b0') succeeded ==
Anaconda
## New conda environment
$ conda create --name mlflow python=3.8
....
$ conda activate mlflow
## Install mlflow libs/tools -- this gives us the mlflow util
$ pip install mlflow
## Export our conda environment so we can deploy later
$ docker build --tag mlflow-rapids-example --file
./Dockerfile.training ./
....
## Create a training run with ‘mlflow run’
$ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite
$ mlflow run --experiment-name "RAPIDS-MLflow-Docker" 
--entry-point hpo_run ./
Docker
$ vi /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": { .... }
}
}
Nvidia-Docker
34
Integration and Training:
Nested HPO Experiments
}
Parent Experiment
Child HPO Runs
Accuracy Metric
Configuration Parameters
Metadata: Tags
35
Model Deployment
$ mlflow models serve -m models:/rapids_mlflow_cli/1 -p 56767
2020/09/24 18:05:26 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2020/09/24 18:05:26 INFO mlflow.pyfunc.backend: === Running command 'gunicorn --timeout=60 -b 127.0.0.1:56767 -w 1 ${GUNICORN_CMD_ARGS} --
mlflow.pyfunc.scoring_server.wsgi:app'
[2020-09-24 18:05:26 -0600] [17024] [INFO] Starting gunicorn 20.0.4
[2020-09-24 18:05:26 -0600] [17024] [INFO] Listening at: http://127.0.0.1:56767 (17024)
[2020-09-24 18:05:26 -0600] [17024] [INFO] Using worker: sync
[2020-09-24 18:05:26 -0600] [17026] [INFO] Booting worker with pid: 17026
[2020-09-24 18:05:28 -0600] [17024] [INFO] Handling signal: winch
Registered Model
Anaconda
This can also be a storage path.
Query Request
Docker Serving (Experimental)
$ mlflow models build-docker -m models:/rapids_mlflow_cli/9 -n mlflow-rapids-example
2020/09/24 16:43:18 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2020/09/24 16:43:18 INFO mlflow.models.docker_utils: Building docker image with name mlflow-rapids-example
…. build process ….
Successfully built 900f8e84b370
Successfully tagged mlflow-rapids-example:latest
$
Registered Model
EXPERIM
ENTAL
36
Endpoint Inference
import json
import requests
host = 'localhost'
port = '56767'
headers = {
"Content-Type": "application/json",
"format": "pandas-split"
}
data = {
"columns": ["Year", "Month", "DayofMonth", "DayofWeek", "CRSDepTime",
"CRSArrTime", "UniqueCarrier",
"FlightNum", "ActualElapsedTime", "Origin", "Dest", "Distance",
"Diverted"],
"data": [[1987, 10, 1, 4, 1, 556, 0, 190, 247, 202, 162, 1846, 0]]
}
resp = requests.post(url="http://%s:%s/invocations" % (host, port),
data=json.dumps(data), headers=headers)
print('Classification: %s' % ("ON-Time" if resp.text == "[0.0]" else "LATE"))
test_query.py
$ python src/rf_test/test_query.py
Classification: ON-Time
Shell
37
RAPIDS Cloud-ML Examples
https://github.com/rapidsai/cloud-ml-examples
RAPIDS + MLflow All-In-One Deployments
(coming soon!)
RAPIDS Cloud Notebooks
Amazon AWS, Databricks, Microsoft Azure, Google GCP
RAPIDS Platform Integration
SageMaker, AzureML, Google AI Platform
RAPIDS Framework Integration
DASK, MLflow, Optuna, RayTune
Thank you!
Find us on Twitter: @rapidsai

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

  • 1.
    John Zedlewski, DevinRobison Accelerating Machine Learning with RAPIDS and MLflow
  • 2.
    2 Outline ● RAPIDS foraccelerated data science ● Why RAPIDS + MLflow? ● Example Integration ● Training and Deployment as an MLproject
  • 3.
  • 4.
    4 Pandas Analytics CPU Memory Data PreparationVisualizationModel Training Scikit-Learn Machine Learning NetworkX Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning Matplotlib Visualization Dask Open Standards Data Science Ecosystem Traditional Python APIs on CPU
  • 5.
    5 cuDF cuIO Analytics Data PreparationVisualizationModel Training cuML, XGBoost Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization Dask GPU Memory RAPIDS End-to-End GPU Accelerated Data Science
  • 6.
    6 Dask GPU Memory Data PreparationVisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization RAPIDS ETL GPU Accelerated Data Wrangling and Feature Engineering cuDF cuIO Analytics
  • 7.
    7 25-100x Improvement Less Code LanguageFlexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More Code Language Rigid Substantially on GPU Traditional GPU Processing Hadoop Processing, Reading from Disk Spark In-Memory Processing Data Processing Evolution Faster Data Access, Less Data Movement
  • 8.
    8 25-100x Improvement Less Code LanguageFlexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More Code Language Rigid Substantially on GPU Traditional GPU Processing Hadoop Processing, Reading from Disk Spark In-Memory Processing Data Processing Evolution Faster Data Access, Less Data Movement RAPIDS Arrow Read ETL ML Train Query 50-100x Improvement Same Code Language Flexible Primarily on GPU
  • 9.
    9 GPU-Accelerated ETL The AverageData Scientist Spends 90+% of Their Time in ETL as Opposed to Training Models
  • 10.
  • 11.
    11 ETL - theBackbone of Data Science PYTHON LIBRARY ▸ A Python library for manipulating GPU DataFrames following the Pandas API ▸ Python interface to CUDA C++ library with additional functionality ▸ Creating GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables ▸ JIT compilation of User-Defined Functions (UDFs) using Numba ▸ Most common formats: CSV, Parquet, ORC, JSON, AVRO, HDF5, and more... cuDF is…
  • 12.
    12 Benchmarks: Single-GPU Speedupvs. Pandas cuDF v0.13, Pandas 0.25.3 ▸ Running on NVIDIA DGX-1: ▸ GPU: NVIDIA Tesla V100 32GB ▸ CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz ▸ Benchmark Setup: ▸ RMM Pool Allocator Enabled ▸ DataFrames: 2x int32 columns key columns, 3x int32 value columns ▸ Merge: inner; GroupBy: count, sum, min, max calculated for each value column 300 900 500 0 Merge Sort GroupBy GPUSpeedupOver CPU 10M 100M 970 500 370 350 330 320
  • 13.
    13 PyTorch, TensorFlow, MxNet Deep Learning Dask cuDFcuIO Analytics GPU Memory Data Preparation VisualizationModel Training cuGraph Graph Analytics cuxfilter, pyViz, plotly Visualization Machine Learning with RAPIDS More Models More Problems cuML Machine Learning
  • 14.
  • 15.
    15 from sklearn.datasets importmake_moons import pandas X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = pandas.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) from sklearn.cluster import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) RAPIDS Matches Common Python APIs CPU-accelerated Clustering
  • 16.
    16 from sklearn.datasets importmake_moons import cudf X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) from cuml import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) RAPIDS Matches Common Python APIs GPU-accelerated Clustering
  • 17.
    17 Decision Trees /Random Forests Linear/Lasso/Ridge/ElasticNet Regression Logistic Regression K-Nearest Neighbors Support Vector Machine Classification and Regression Naive Bayes K-Means DBSCAN Spectral Clustering Principal Components Singular Value Decomposition UMAP Spectral Embedding T-SNE Holt-Winters Seasonal ARIMA / Auto ARIMA More to come! Random Forest / GBDT Inference (FIL) Time Series Clustering Decomposition & Dimensionality Reduction Preprocessing Inference Classification / Regression Hyper-parameter Tuning Cross Validation Key: Preexisting | NEW or enhanced for 0.15 Algorithms GPU-accelerated Scikit-Learn Text vectorization (TF-IDF / Count) Target Encoding Cross-validation / splitting
  • 18.
    18 Benchmarks: Single-GPU cuML vsScikit-learn 1x V100 vs. 2x 20 Core CPUs (DGX-1, RAPIDS 0.15)
  • 19.
    19 Forest Inference cuML’s ForestInference Library accelerates prediction (inference) for random forests and boosted decision trees: ▸ Works with existing saved models (XGBoost, LightGBM, scikit-learn RF, cuML RF) ▸ Lightweight Python API ▸ Single V100 GPU can infer up to 34x faster than XGBoost dual-CPU node ▸ Over 100 million forest inferences/sec on a DGX-1V Taking Models From Training to Production 4000 3000 2000 1000 0 Bosch Airline Epsilon Time(ms) CPU Time (XGBoost, 40 Cores) FIL GPU Time (1x V100) Higgs XGBoost CPU Inference vs. FIL GPU (1000 trees) 23x 36x 34x 23x
  • 20.
    20 XGBoost + RAPIDS:Better Together ▸ RAPIDS comes paired with XGBoost 1.2 (as of 0.15) ▸ XGBoost now builds on the GoAI interface standards to provide zero-copy data import from cuDF, cuPY, Numba, PyTorch and more ▸ Official Dask API makes it easy to scale to multiple nodes or multiple GPUs ▸ gpu_hist tree builder delivers huge perf gains Memory usage when importing GPU data decreased by 2/3 or more ▸ New objectives support Learning to Rank on GPU All RAPIDS changes are integrated upstream and provided to all XGBoost users – via pypi or RAPIDS conda
  • 21.
    21 https://github.com/rapidsai https://medium.com/rapids-ai Explore: RAPIDSGetting Started, Code, and Blogs From intro to in-depth https://rapids.ai
  • 22.
    22 Exactly as itsounds—our goal is to make RAPIDS as usable and performant as possible wherever data science is done. We will continue to work with more open source projects to further democratize acceleration and efficiency in data science. RAPIDS Everywhere The Next Phase of RAPIDS
  • 23.
  • 24.
    24 “... an opensource platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.” - mlflow.org …. And it works with RAPIDS, out of the box! MLflow
  • 25.
    25 Why RAPIDS +MLflow? RAPIDS substantial speedups across a wide range of machine learning and ETL tasks, SKlearn compatible API. MLflow improved collaboration, experiment tracking, model storage, registration, and deployment. Production / Engineering Update Good? Training ValidateUpdate
  • 26.
    26 HPO Use Case:100-Job Random Forest Airline Model Huge speedups translate into >7x TCO reduction Based on sample Random Forest training code from cloud-ml-examples repository, running on Azure ML. 10 concurrent workers with 100 total runs, 100M rows, 5-fold cross-validation per run. GPU nodes: 10x Standard_NC6s_v3, 1 V100 16G, vCPU 6 memory 112G, Xeon E5-2690 v4 (Broadwell) - $3.366/hour CPU nodes: 10x Standard_DS5_v2, vCPU 16 memory 56G, Xeon E5-2673 v3 (Haswell) or v4 (Broadwell) - $1.017/hour" Cost Time(hours)
  • 27.
    27 Integration and Training: NestedHPO Experiments } Parent Experiment Child HPO Runs Accuracy Metric Configuration Parameters Metadata: Tags
  • 28.
    28 Component Overview: Some Terminology LocalFile System Backend Store Artifact Store /tmp/... /
  • 29.
    29 A Quick Example: Convertan Existing Project 29 Conversion to RAPIDS and MLflow Add nesting+HPO and model logging Add project entry points Anaconda and Docker training Deployment A Trained Model
  • 30.
    30 Integration and Training: BasicConversion from sklearn.ensemble import RandomForestClassifier def train(fpath, max_depth, max_features, n_estimators): X_train, X_test, y_train, y_test = load_data(fpath) mod = RandomForestClassifier( max_depth=max_depth, max_features=max_features, n_estimators=n_estimators ) mod.fit(X_train, y_train) preds = mod.predict(X_test) accuracy = accuracy_score(y_test, preds) return mod, accuracy Start MLflow ‘run’ from cuml.ensemble import RandomForestClassifier def train(fpath, max_depth, max_features, n_estimators): X_train, X_test, y_train, y_test = load_data(fpath) with mlflow.start_run(run_name="RAPIDS-MLFlow"): mlparams = { "max_depth": str(max_depth), "max_features": str(max_features), "n_estimators": str(n_estimators), } mlflow.log_params(mlparams) mod = RandomForestClassifier( max_depth=max_depth, max_features=max_features, n_estimators=n_estimators ) mod.fit(X_train, y_train) preds = mod.predict(X_test) accuracy = accuracy_score(y_test, preds) mlflow.log_metric("accuracy", accuracy) return mod Record Parameters Record Performance Metrics Unmodified Training Code Augmented Training Code SKlearn to cuML
  • 31.
    31 Integration: Nesting+HPO and ModelLogging hpo_runner = HPO_Runner(hpo_train) with mlflow.start_run(run_name=f"RAPIDS-HPO", nested=True): search_space = [ uniform("max_depth", 5, 20), uniform("max_features", 0.1, 1.0), uniform("n_estimators", 150, 1000), ] hpo_results = hpo_runner(fpath, search_space) artifact_path = "rapids-mlflow-example" with mlflow.start_run(run_name='Final Classifier', nested=True): mlflow.sklearn.log_model(hpo_results.best_model, artifact_path=artifact_path, registered_model_name="rapids-mlflow-example", conda_env='conda/conda.yaml') from cuml.ensemble import RandomForestClassifier from your_hpo_library import HPO_Runner # Called by hpo_runner def hpo_train(params): X_train, X_test, y_train, y_test = load_data(params.fpath) with mlflow.start_run(run_name=f”Trial {params.trail}", nested=True): mod = RandomForestClassifier( max_depth=params.max_depth, max_features=params.max_features, n_estimators=params.n_estimators ) mod.fit(X_train, y_train) preds = mod.predict(X_test) accuracy = accuracy_score(y_test, preds return mod, accuracy Add HPO Runner Log Runs and Best Model Import our HPO library Update Nested Training Register Best Result
  • 32.
    32 Integration and Training: PackagingYour Environment ./ ├── airline_small.parquet ├── envs └── conda.yaml ├── Dockerfile.training ├── MLproject ├── README.md └── src ├── entrypoint.sh └── train.py Project Environment MLProject name: rapids-mlflow docker_env: image: mlflow-rapids-example entry_points: hpo_run: parameters: fpath: {type: str} n_estimators: {type: int, default: 100} max_features: {type: float} max_depth: {type: int} command: "/bin/bash src/entrypoint.sh src/train.py --fpath={fpath} --n_estimators={n_estimators} --max_features={max_features} --max_depth={max_depth}" name: rapids-mlflow conda_env: envs/conda.yaml entry_points: hpo_run: parameters: fpath: {type: str} n_estimators: {type: int, default: 100} max_features: {type: float} max_depth: {type: int} command: "python src/train.py --fpath={fpath} --n_estimators={n_estimators} --max_features={max_features} --max_depth={max_depth}" MLProject (Anaconda) MLProject (Docker/K8s) FROM rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8 RUN source activate rapids && pip install mlflow Dockerfile.training $ conda env export --name mlflow > envs/conda.yaml Conda Export Conda Path Docker Path
  • 33.
    33 Integration and Training: BringingThings Together ## New conda environment $ conda create --name mlflow python=3.8 .... $ conda activate mlflow ## Install mlflow libs/tools -- this gives us the mlflow util $ pip install mlflow ## Create a training run with ‘mlflow run’ $ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite ## Train in a custom Conda Environment $ mlflow run --experiment-name "RAPIDS-MLflow-Conda" --entry-point hpo_run ./ .... Created version '10' of model 'rapids_mlflow_cli'. Model uri: ./mlruns/3/c20642df4137490fba2ca96a7b4431b0/artifacts/Airline-De mo 2020/09/29 23:36:37 INFO mlflow.projects: === Run (ID 'c20642df4137490fba2ca96a7b4431b0') succeeded == Anaconda ## New conda environment $ conda create --name mlflow python=3.8 .... $ conda activate mlflow ## Install mlflow libs/tools -- this gives us the mlflow util $ pip install mlflow ## Export our conda environment so we can deploy later $ docker build --tag mlflow-rapids-example --file ./Dockerfile.training ./ .... ## Create a training run with ‘mlflow run’ $ export MLFLOW_TRACKING_URI=sqlite:////tmp/mlflow-db.sqlite $ mlflow run --experiment-name "RAPIDS-MLflow-Docker" --entry-point hpo_run ./ Docker $ vi /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { .... } } } Nvidia-Docker
  • 34.
    34 Integration and Training: NestedHPO Experiments } Parent Experiment Child HPO Runs Accuracy Metric Configuration Parameters Metadata: Tags
  • 35.
    35 Model Deployment $ mlflowmodels serve -m models:/rapids_mlflow_cli/1 -p 56767 2020/09/24 18:05:26 INFO mlflow.models.cli: Selected backend for flavor 'python_function' 2020/09/24 18:05:26 INFO mlflow.pyfunc.backend: === Running command 'gunicorn --timeout=60 -b 127.0.0.1:56767 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app' [2020-09-24 18:05:26 -0600] [17024] [INFO] Starting gunicorn 20.0.4 [2020-09-24 18:05:26 -0600] [17024] [INFO] Listening at: http://127.0.0.1:56767 (17024) [2020-09-24 18:05:26 -0600] [17024] [INFO] Using worker: sync [2020-09-24 18:05:26 -0600] [17026] [INFO] Booting worker with pid: 17026 [2020-09-24 18:05:28 -0600] [17024] [INFO] Handling signal: winch Registered Model Anaconda This can also be a storage path. Query Request Docker Serving (Experimental) $ mlflow models build-docker -m models:/rapids_mlflow_cli/9 -n mlflow-rapids-example 2020/09/24 16:43:18 INFO mlflow.models.cli: Selected backend for flavor 'python_function' 2020/09/24 16:43:18 INFO mlflow.models.docker_utils: Building docker image with name mlflow-rapids-example …. build process …. Successfully built 900f8e84b370 Successfully tagged mlflow-rapids-example:latest $ Registered Model EXPERIM ENTAL
  • 36.
    36 Endpoint Inference import json importrequests host = 'localhost' port = '56767' headers = { "Content-Type": "application/json", "format": "pandas-split" } data = { "columns": ["Year", "Month", "DayofMonth", "DayofWeek", "CRSDepTime", "CRSArrTime", "UniqueCarrier", "FlightNum", "ActualElapsedTime", "Origin", "Dest", "Distance", "Diverted"], "data": [[1987, 10, 1, 4, 1, 556, 0, 190, 247, 202, 162, 1846, 0]] } resp = requests.post(url="http://%s:%s/invocations" % (host, port), data=json.dumps(data), headers=headers) print('Classification: %s' % ("ON-Time" if resp.text == "[0.0]" else "LATE")) test_query.py $ python src/rf_test/test_query.py Classification: ON-Time Shell
  • 37.
    37 RAPIDS Cloud-ML Examples https://github.com/rapidsai/cloud-ml-examples RAPIDS+ MLflow All-In-One Deployments (coming soon!) RAPIDS Cloud Notebooks Amazon AWS, Databricks, Microsoft Azure, Google GCP RAPIDS Platform Integration SageMaker, AzureML, Google AI Platform RAPIDS Framework Integration DASK, MLflow, Optuna, RayTune
  • 38.
    Thank you! Find uson Twitter: @rapidsai